The Ultimate 686 Benchmark Comparison

Discussion about old PC hardware.

Re: The Ultimate 686 Benchmark Comparison

Postby kool kitty89 » 2016-2-29 @ 13:46

Further clarification on the Quake 1 vs 2 performance oddities:

P6 CPUs (other than the PII overdrive) and all 3 versions of the VIA C3 core run Quake 2 significantly faster than the same processors do Quake 1 (in 640x480 software mode), while P5 and all other Socket 5/7 processors perform significantly worse in Quake 2 than 1.

The gap between K6 and K6-2 as well as K6-2 and K6-3 (and 2+) also widens with Quake 2 compared to 1. (and the big cache Xeons and Ppro have more consistent advantages over their small-cache and half-speed cache counterparts too -the Celeron vs Pentium II gap also narrows a lot more than in Quake I) In this respect at least, it seems like Quake 2 favors fast L2 caches (even small ones) more than Quake did. The gap is also wider between Nehemiah and both Samuel and Ezra cores for Quake 2 (and while Samuel II and Ezra have L2 caches of similar size to Nehemia, given the cache performance benchmarks, it's very slow, perhaps even clocked at half-speed like the FPU supposedly is on those models) and the gap between the L2-cache-less Samuel core and the others further demonstrates the L2 cache affinity for Quake 2.

Performance among Socket 5/7 CPUs alone seems to be roughly comparable proportionally for both Quake 1 and 2, but Quake 1 just seems to have some odd affinity for Socket 7 in general. (at least among the motherboards tested) The 6x86/MII seems to loose less ground than the others in Quake 2, and there might be a few other shifts like that (don't see any at a glance though), but overall relative performance is still much closer than with non-S5/7/SS7 chips compared. (and no GL Quake to compare with Quake 2's OpenGL performance)

Hmm, though come to think of it, offering an integer based geometry pipeline patch for GL Quake (or Quake MiniGL), or even an MMX geometry pack might have been really helpful for accelerating non-pentium CPUs at the time. Though given the timing, MMX would've worked better for Quake 2 unless GL Quake and Quake 2's open GL engine were similar enough to carry a patch over to both without too much effort. A plain integer math pack might've been less useful given the less dramatic, depending on the CPU. (K5 would probably see the biggest improvement) Working entirely in 16-bit fixed-point math might also cache a bit better and speed things up that way.

I haven't seen mention of any game doing that without making the entire breadth/standard engine integer-oriented in general, and I don't think I've heard of any using MMX based geometry engines. (even though fixed-point DSPs or vector processors are what game consoles were using at the time -probably because the P55C's MMX performance -and basic integer math performance- were too weak to be compelling alternatives to its FPU) And yeah, 16-bit resolution introduces more errors and potential artifacts than 32-bit fixed or floating point implementations, but for most stuff at the time it was good enough and worth the performance gain ... outside of the pentium at least.
kool kitty89
Member
 
Posts: 272
Joined: 2012-2-15 @ 08:43
Location: San Jose, CA

Re: The Ultimate 686 Benchmark Comparison

Postby kool kitty89 » 2016-3-02 @ 01:53

falloutboy wrote:I have seen Quake 2 benchmarks with the Athlon cpu loosing performance when 3DNow! is enabled (not sure what GPU was used).
http://web.archive.org/web/200102030623 ... ivers.html

Forgot to comment on this before, but the inclusion of GLQuake tests there seems to imply that engine has similar CPU affinity to Quake 2 OpenGL (the Athlon wasn't tested, but given the P6 performance, I'd be surprised if it didn't scale similarly to Quake 2). So the odd performance affinity is probably limited to Quake 1's software renderer alone.
kool kitty89
Member
 
Posts: 272
Joined: 2012-2-15 @ 08:43
Location: San Jose, CA

Re: The Ultimate 686 Benchmark Comparison

Postby kool kitty89 » 2016-3-08 @ 17:53

As I got into over at the Cyrix Appreciation Thread: viewtopic.php?f=25&t=31111&p=483045#p483045

This article: http://www.azillionmonkeys.com/qed/cpuwar.html

Points out that the CXT revision K6-2 added 0-cycle (superscalar) Fxch execution like the P5 and P6, so this would probably be the main gain in FPU performance over the K6 in the benchmarks. (the CXT also fixes the earlier model's lack of pipelined stores, addressing some of the poor memory bandwidth seen in earlier models) On the whole, though, the K6 seems more optimized for low latency than high bandwidth/throughput than the P6 and might be part of the reason it performs relatively well with slow memory bandwidth/throughput scores (and SS7 chipset slowness compared to S370/Slot1), a trend that widened with the Netburst architecture compared to K7. (and the K7 was more bandwidth-friendly/intensive than the K6, but vastly more latency-optimized over bandwidth than Netburst)

It's not just a matter of the short pipeline on the K6, but general fast operation.

There's also quite a few areas the 6x86 and K6 are faster, more advanced, more efficient and just better designed parts on paper but failed to result in real-world gains when fed with P5 or P6-optimized compilers. (from FPU scheduling to complete omission of LOOP instruction use in Intel compilers, it's a serious problem) Then again, Intel continues that trend to this day, intentionally designing compilers that not only favor their own processors, but intentionally cripple competing ones or even disable functionality. (which is technically legal, unless you fail to make developers aware you're doing such -as happened with several lawsuits some years back regarding multimedia extensions being disabled on non-intel CPUs, I think some of the vector processing instructions added in the Phenom)


AMD and Cyrix could/should have promoted their own optimized compilers to compete with this (which would be fairly quick/painless on the developer end to re-compile and offer CPU-specific operating modes for various drivers -mostly on the OS end I'd think, outside proprietary multimedia or video editing programs -and games). Optimized FPU scheduling may very well be why later revisions of Quake II's miniGL performs so much better on the K6-2 even with 3DNow! disabled. (that and possibly better revision of integer operations to favor the K6-2 as well)


These are also the sort of things that, had said processors been used on more closed-box devices (like consoles or non-Windows/DOS home computers -Macintosh/Amiga/Atari ST style/etc) such issues likely wouldn't have materialized as all programs would be oriented towards the single architecture.


Additionally, the 6x86 and K6 had much better legacy support for 16-bit code than the P5 or P6 (obviously more so the PPro but even PII with its enhanced 16-bit code operation) and particularly dramatically so for code not specifically made with Intel compilers or hand-coded using P5 or P6 scheduling rules. (or in short, Cyrix and AMD made better 386/486 code accelerators than Intel did)
For that matter, the 6x86's balanced ALU and FPU execution performance is much better at accelerating code optimized for 486 performance than the P5 is. (as in the 6x86's superscalar integer execution increased roughly proportionally to the FPU execution but the P5 improved FPU performance vastly disproportionately almost to the point of 2:1 disparity -more than that on paper, but roughly so in real world operation)

It did make the 5x86 and Media GX's operation more balanced by comparison. (would've been interesting if they'd made a low-cost gaming/multimedia-oriented companion to the MII out of the MediaGX's core -cut out the DRAM controller and VDC and mate it with the 6x86MX's big cache and S7 FSB and it might have made for a good Winchip-sized core with better ALU and vastly better FPU performance -or ... more like an earlier Winchip2 without 3DNow! and with higher max clock speeds)


I did mistakenly assume the '33 MHz' FSB was a huge bottleneck on the MediaGX, but it's rather misleading given that 'bus' is more like the PCI/DMA/external I/O interface and NOT a memory interface. The memory latency and throughput figures are rather good compared to S7/SS7 6x86/MII performance or several other CPUs, and the performance scales up really well at higher CPU clocks for the Media GX. (the onboard memory controller seems to do rather well on the whole) As such, I'd assume the poor performance scaling at higher clock speeds is due to the small (12kB) L1 cache and lack of L2. Addition of an L2 cache controller and optional board-level cache probably could've pushed it more into S7 or PII/celeron level performance and made integrated AT/ATX implementations of the Media GX more comepetitive in the mainstream. (obviously the bottom-end set-tip boxes wouldn't use that cache, but lower-end mainstream it would've been necessary -that or expanding that L1 to 64 kB when they moved to 250 nm)

Given the memory controller performance and decent integrated video, it seems even more like Cyrix missed the boat going with a system-on-a-chip rather than an integrated chipset design (might have made serious competition for SiS, especially with their relatively modest S7 memory performance) and have CPU+motherboard combinations of various sorts, possibly some surface-mounted. (and have a standalone S7 MediaGX CPU alongside the 6x86 -and have both matched very well to Cyrix's own chipset ... maybe even beating VIA's performance) An in-house chipset certainly would've given more flexibility for oddball FSB speeds too rather than coordinating with chipset and motherboard manufacturers.

For that matter, given IBM had continued to manufacture the old 5x86C into the late 90s, having a low-end embedded 32-bit/486 bus chipset would've made sense too. (not sure if IBM ever die-shrunk the 5x86 or kept it running on the old .65 micron process that whole time ... given the large die-size and relatively low cost of a straight -non optimized- die-shrink, and ability to safely run 350 nm parts at 3.3-3.6V, I could see them more likely spinning off late models to that process rather than wasting the silicon on the old fab -of course, that'd be at the point when .350 was aging a bit and .250 micron was mainstream, around 1998)





Anyway, on the Athlon again, I'm still a bit baffled by its quake software performance and some of the other benchmarks, including the Sandra ones. They don't match up well with the period benchmarks/reviews here:
http://www.pcstats.com/articleview.cfm? ... 441&page=2
http://www.xbitlabs.com/articles/cpu/di ... thlon.html
(granted the former is a Duron 700, but should still be in the ballpark and not account for the vastly poorer Athlon600 scores in Sandra)

That xbitlabs review has some neat details on 3DNow! performance of the Athlon, though, both standard (K6-2 compatible) 3DNow! and the Enhanced extensions the K7 added. Vanilla FPU usage is definitely far slower on the athlon. (and with the Enhanced 3DNow! enabled, it's nearly double the CPU 3DMarks of the raw FPU -and superior to PIII SSE performance at the same clock speed, faster than a 650 MHz PIII for that matter even with the slower standard 3DNow! set)


Edit:
I wonder if Quake at 320x200 would shed any more light on this ... probably not for the Athlon, but perhaps for the P5 vs everything else. (on paper, the only thing consistently faster on the P5 family than P6 on the FPU is Fmul, which should have a bigger impact at low res than high res -given Quake's perspective correction is Fdiv-bound and more CPU intensive at higher resolutions vs low res where Fmul is more significant -for vertex computation; which should also show a bigger dive on Cyrix CPUs as their Fdiv is fast but Fmul -and add and sub and xch- is slow -probably would've favored the K5 a lot more too, given its Fmul is fast and Fdiv is very slow compared to all the others; the Media GX and 486/5x86 probably would've shown better at low-res too given the 32-bit bus is less of a bottleneck)

It also would've been a better 1:1 comparison for Doom.

Unreal's software renderer would've been neat too, but that's probably more worth including in a different benchmark compilation. (one of the best examples of period MMX performance -I don't think the software renderer uses 3DNow! ... but it might; the K6-2's strong MMX performance should come into play though)
kool kitty89
Member
 
Posts: 272
Joined: 2012-2-15 @ 08:43
Location: San Jose, CA

Re: The Ultimate 686 Benchmark Comparison

Postby kool kitty89 » 2016-3-10 @ 15:55

Just stumbled on this old Tom's Hardware article on the M2 and noticed the Quake (and overall DOS game performance) figures were far better for the PPro and PII compared to the P55C than the 686 benchmark results manage.

http://www.tomshardware.com/reviews/return-jedi,26.html

It's running 640x480, so same resolution, though it's using Timedemo2 and tests both DOS Quake 1.06 and WinQuake 1.09 (the results seem very similar though). It's not just the P6 performance either, but ever single CPU on their list other than the Pentium MMX does significantly better than the results in the 686 benchmarks. (the MMX-200 manages 15.9 FPS compared to 16.1 in the 686 tests, so slightly slower but a small difference compared to all the others and within a reasonable margin of error for the P55C, particularly given the different Timedemo being run)

Also neat that they listed a motherboard performance scale (at least for the WinNT Benchmark with the M2 at 2.5x75 MHz). Though they don't compare the PA-2005 in that list that Red Hill favored for Cyrix parts (especially at 75 MHz -though more for stability than speed) and I don't think the FIC 502 was even available at the time of that review.


Edit:
The 3DBench scores are also vastly better for the PPro and P2 than the 686 test results.
kool kitty89
Member
 
Posts: 272
Joined: 2012-2-15 @ 08:43
Location: San Jose, CA

Re: The Ultimate 686 Benchmark Comparison

Postby luckybob » 2016-5-15 @ 01:22

I may have just stumbled on something here. I don't know if it will invalidate my Pentium pro overdrive results, but I would be negligent to not bring it up.

Quake 1 LOVES fast ram. With my project of getting my P65UP8 board going, I ran quake in dos once more. ( viewtopic.php?f=25&t=48043 ) Mostly to see if I could repeat the results I got. and I did, and I got even better scores, but I found out why. Originally I used a P65UP5, and that board i had 8x 64mb simms from IBM. Long story short, they are 45ns 3.3v edo simms. This simms dont work in the UP8, so I got bog-standard 60ns modules. With these modules and literally default bios settings I get ~16fps. However if I go into the bios setting and torque the ram timings to lowest possible, the fps jumps suddenly to 29.8fps. For the record, the 128mb modules I have are marked 60ns, but the datasheet tells me this is the slowest speed grade and they came in modules capable of 40ns. I did not feel it would be unrealistic to expect these 60ns modules to be capable of tight timings. Also this was done with dos 6.22 and the included memory manager, nothing else loaded.

Also, this is with the shit-tastic S3 trio onboard video. Switching to the matrox 200 made NO difference in FPS. Picture quality was glorious, but identical fps.

image of screen: Image
It is a mistake to think you can solve any major problems just with potatoes.
User avatar
luckybob
l33t
 
Posts: 2809
Joined: 2009-4-30 @ 04:43

Re: The Ultimate 686 Benchmark Comparison

Postby clueless1 » 2016-5-15 @ 03:12

That's insane to get nearly double the framerate with just memory timings. Do you notice any similar performance improvements in other benchmarks?
The more I learn, the more I realize how much I don't know.
Let's benchmark our systems with cache disabled
DOS PCI Graphics Card Benchmarks
User avatar
clueless1
l33t
 
Posts: 3271
Joined: 2015-12-22 @ 17:43
Location: Midwest US

Re: The Ultimate 686 Benchmark Comparison

Postby luckybob » 2016-5-15 @ 08:20

All other benchmarks run a little faster than a standard p2-333. Except for ones that really test the memory. Those got massive gains. The benchmarks from aida64 for memory tasks went up considerably. I didn't have the foresight to record everything, but it would be a simple task to reset the bios, and do that.
It is a mistake to think you can solve any major problems just with potatoes.
User avatar
luckybob
l33t
 
Posts: 2809
Joined: 2009-4-30 @ 04:43

Re: The Ultimate 686 Benchmark Comparison

Postby feipoa » 2016-5-15 @ 10:35

I didn't even know 45 ns EDO SIMMs existed. Were they hard to find? Any idea why they don't work on the P65UP8?

I would be curious to see what cachechk says about your L1/L2/RAM Read/RAM Write speeds with the different CMOS RAM settings you were playing with, that is, the settings which yielded 16 fps and then 30 fps.

cachechk -d -t6
cachechk -d -w -t6

I would have never expected a 2-fold change in benchmark results. It makes me wonder if the cache speed is also being affected by your choice of RAM timings and if cache is working properly.

Your 30 fps result achieved recently is greater than what you supplied for the 686 benchmarks (27.3 fps). Are the settings identical?
User avatar
feipoa
l33t
 
Posts: 4290
Joined: 2011-3-07 @ 13:54
Location: Canada

Re: The Ultimate 686 Benchmark Comparison

Postby feipoa » 2016-5-15 @ 11:04

I ran a few tests on my IBM 5x86-133/2x system in DOS Quake.

320x200

RAM Read Wait State = 1ws (fastest stable)
19.8 fps

RAM Read Wait State = 3ws (Slowest possible)
19.3 fps


I didn't even get close to a 2-fold change in the frame rate. Even if I turn off L2 cache and use 1 ws, I still get 18.6 fps.
User avatar
feipoa
l33t
 
Posts: 4290
Joined: 2011-3-07 @ 13:54
Location: Canada

Re: The Ultimate 686 Benchmark Comparison

Postby mrau » 2016-5-15 @ 12:16

feipoa wrote:I didn't even get close to a 2-fold change in the frame rate.

quite frankly, this is not even intel pentium, your cpu is probably too slow for this to make a visible difference; i bet a ppro in this scenario is just waiting for the main memory most of the time, however i do not understand why, since it has a gigantastic cache;
mrau
Oldbie
 
Posts: 810
Joined: 2015-11-28 @ 12:43

Re: The Ultimate 686 Benchmark Comparison

Postby luckybob » 2016-5-15 @ 21:21

so, in the name of science, I decided to take a closer look at the bios settings. For the record the 128mb simms are KM44C16104BS-6 (16 chips) and the 64mb simms are IBM FRU 42L0225 with KM44V16104BK-4 (8 chips) The only discernable difference is the 64mb ones are 3.3V. I had NOT realized they were 45ns 3.3 chips until I got the UP8 board. so I've been running them at 5v. Something I plan on not doing anymore.

bios defaults: 6.8 fps

changes:
ide hdd block mode sectors > enabled
mps 1.4 support > enabled
memory auto config > enabled 60ns
cpu-to-ide posting > enabled
uswc write posting > enabled
cpu-to pci write post > enabled
pci-to-dram pipeline > enabled
pci burst write combining > enabled
read-around-write > enabled
onboard serial and parallel disabled

new fps: 15.6

changes:
pci vga palette snoop > enabled

new fps: 15.5

changes:
video memory cache mode: UC > USWC

new fps: 15.5

changes:
16-bit i/o recovery time > 1 busclk (from 4)
8-bit > 1 (from 8)

new fps: 15.6

changes:
dram auto > disabled
dram read 2/3/4 > 2/2/3
dram writ 3/3/3 > 2/2/3
ras precharge 3t > 3t
ras to cas delay 1t > 0t
ma wait state 1 w/s > 0 w/s

new fps: 16.2

changes:
disable onboard S3 video, install Matrox g200

new fps: 29.1

changes:
revert to bios defaults AGAIN.

new fps: 5.1
here on out I made the same changes in the same order
1: 6.2 fps
2: 6.5 fps
3: 26.1 fps AH-HA!
4: 26.1 fps
5: 29.1 fps

Ok so the memory timing made a ~10% difference, but the real star of the show is USWC mode. Thanks to a bit of google:
"This is yet another BIOS feature with a misleading name. It does not cache the video memory or even graphics data (such data is uncacheable anyway). This BIOS feature allows you to control the USWC (Uncached Speculative Write Combining) write combine buffers.

When set to USWC, the write combine buffers will accumulate and combine partial or smaller graphics writes from the processor and write them to the graphics card as burst writes. When set to UC, the write combine buffers will be disabled. All graphics writes from the processor will be written to the graphics card directly.

It is highly recommended that you set this feature to USWC for improved graphics and processor performance. However, if you are using an older graphics card, it may not be compatible with this feature. Enabling this feature with such graphics cards will cause a host of problems like graphics artifacts, system crashes and even the inability to boot up properly. If you face such problems, you should set this BIOS feature to UC immediately."


Obviously the Matrox 200 is compatable with USWC where my onboard S3 virge is not.
It is a mistake to think you can solve any major problems just with potatoes.
User avatar
luckybob
l33t
 
Posts: 2809
Joined: 2009-4-30 @ 04:43

Re: The Ultimate 686 Benchmark Comparison

Postby clueless1 » 2016-5-15 @ 21:29

So USWC is disabled by default on the motherboard. Is there an Optimized Settings mode in the BIOS that turns it on, or are you left to figure it out on your own to gain such astounding benefits?
The more I learn, the more I realize how much I don't know.
Let's benchmark our systems with cache disabled
DOS PCI Graphics Card Benchmarks
User avatar
clueless1
l33t
 
Posts: 3271
Joined: 2015-12-22 @ 17:43
Location: Midwest US

Re: The Ultimate 686 Benchmark Comparison

Postby mrau » 2016-5-15 @ 21:50

i have never seen this setting in any of my biosen, can it be done if bios does not offer the setting? is it the same thing those nifty mtrr/svga tools trigger when accelerating LFB access?
mrau
Oldbie
 
Posts: 810
Joined: 2015-11-28 @ 12:43

Re: The Ultimate 686 Benchmark Comparison

Postby luckybob » 2016-5-15 @ 21:52

Like everything else, its up to you to figure it out yourself. It is disabled by default in the bios. Online I see claims that NT4 doesn't play well with USWC (or anything else for that matter).

@mrau
I don't know about your bios. I never really pay much attention to the settings. I have a habit of just turning everything on, and going for it. Supposedly its only useful for software rendering. So it became rather useless as time went on.
It is a mistake to think you can solve any major problems just with potatoes.
User avatar
luckybob
l33t
 
Posts: 2809
Joined: 2009-4-30 @ 04:43

Re: The Ultimate 686 Benchmark Comparison

Postby feipoa » 2016-5-15 @ 22:27

So USWC gives you the 100% performance boost? I don't think I've ever seen USWC on a motherboard as old as a socket 8. Was this common on socket 7/8 boards and my supply set is just too small?

I have always left USWC disabled because of internet comments of instability, and possible (marginal) slow downs in Windows under certain conditions. Now I am considering turning it on again.

What else I found very interesting is that your PPro-PIIOD-333 scores 5.1-6.8 fps with BIOS defaults enabled. My IBM 5x86-133, with the same graphics card, scores 6.2 fps at 640x480.
User avatar
feipoa
l33t
 
Posts: 4290
Joined: 2011-3-07 @ 13:54
Location: Canada

Re: The Ultimate 686 Benchmark Comparison

Postby mrau » 2016-5-16 @ 03:51

@feipoa yes, slowdown are reported, even massive ones; this trigger ellegedly was introduced with ppro, so im not sure a socket7 would have that
i would love to know if this works by default in mobos where its not a trigger in the setup program
mrau
Oldbie
 
Posts: 810
Joined: 2015-11-28 @ 12:43

Re: The Ultimate 686 Benchmark Comparison

Postby noop » 2016-7-25 @ 15:43

kool kitty89 wrote:don't think I've heard of any using MMX based geometry engines

I had a working one for DirectX6, in early 2000s, but never did anything with it :( Very basic - transformations and normal-based diffuse lighting. It performed rather well. And at that time it was hardly useful for anything even if it actually had some advantage over SSE math (but my particular implementation used a bit of SSE as well) Motivation was the absence of T&L support in my videocard (Kyro 2)
User avatar
noop
Newbie
 
Posts: 28
Joined: 2015-7-20 @ 15:42
Location: Minsk, Belarus

Re: The Ultimate 686 Benchmark Comparison

Postby alvaro84 » 2017-6-11 @ 05:39

feipoa wrote:So USWC gives you the 100% performance boost? I don't think I've ever seen USWC on a motherboard as old as a socket 8. Was this common on socket 7/8 boards and my supply set is just too small?

I have always left USWC disabled because of internet comments of instability, and possible (marginal) slow downs in Windows under certain conditions. Now I am considering turning it on again.


Sorry for the following thread necromancy.

USWC is present in my Asus P3B-F's BIOS too but if it wouldn't I could still achieve the same (or even bigger, I don't know why!) speedup with the Fastvid utility. It can do the trick on most (every?) P6-based system. It even worked on my ISA P4 board, I guess the Northwood has the same memory type range registers as the PPro/2/3.

Video access speed affects greatly gaming benchmarks and its effect grows bigger as frame rates increase. It can easily double a several-hundred-fps 3DBench result. On a strong P3 system it sends it through the roof and make it a guesswork because 3DBench can't properly display frame rates over 999.9 :lol:
Shame on us, doomed from the start
May God have mercy on our dirty little hearts
User avatar
alvaro84
Newbie
 
Posts: 72
Joined: 2007-5-24 @ 05:04
Location: Fehérvárcsurgó, Hungary

Previous

Return to General Old Hardware

Who is online

Users browsing this forum: Bing [Bot] and 6 guests