VOGONS


The Ultimate 686 Benchmark Comparison

Topic actions

Reply 100 of 145, by kool kitty89

User metadata
Rank Member
Rank
Member

Further clarification on the Quake 1 vs 2 performance oddities:

P6 CPUs (other than the PII overdrive) and all 3 versions of the VIA C3 core run Quake 2 significantly faster than the same processors do Quake 1 (in 640x480 software mode), while P5 and all other Socket 5/7 processors perform significantly worse in Quake 2 than 1.

The gap between K6 and K6-2 as well as K6-2 and K6-3 (and 2+) also widens with Quake 2 compared to 1. (and the big cache Xeons and Ppro have more consistent advantages over their small-cache and half-speed cache counterparts too -the Celeron vs Pentium II gap also narrows a lot more than in Quake I) In this respect at least, it seems like Quake 2 favors fast L2 caches (even small ones) more than Quake did. The gap is also wider between Nehemiah and both Samuel and Ezra cores for Quake 2 (and while Samuel II and Ezra have L2 caches of similar size to Nehemia, given the cache performance benchmarks, it's very slow, perhaps even clocked at half-speed like the FPU supposedly is on those models) and the gap between the L2-cache-less Samuel core and the others further demonstrates the L2 cache affinity for Quake 2.

Performance among Socket 5/7 CPUs alone seems to be roughly comparable proportionally for both Quake 1 and 2, but Quake 1 just seems to have some odd affinity for Socket 7 in general. (at least among the motherboards tested) The 6x86/MII seems to loose less ground than the others in Quake 2, and there might be a few other shifts like that (don't see any at a glance though), but overall relative performance is still much closer than with non-S5/7/SS7 chips compared. (and no GL Quake to compare with Quake 2's OpenGL performance)

Hmm, though come to think of it, offering an integer based geometry pipeline patch for GL Quake (or Quake MiniGL), or even an MMX geometry pack might have been really helpful for accelerating non-pentium CPUs at the time. Though given the timing, MMX would've worked better for Quake 2 unless GL Quake and Quake 2's open GL engine were similar enough to carry a patch over to both without too much effort. A plain integer math pack might've been less useful given the less dramatic, depending on the CPU. (K5 would probably see the biggest improvement) Working entirely in 16-bit fixed-point math might also cache a bit better and speed things up that way.

I haven't seen mention of any game doing that without making the entire breadth/standard engine integer-oriented in general, and I don't think I've heard of any using MMX based geometry engines. (even though fixed-point DSPs or vector processors are what game consoles were using at the time -probably because the P55C's MMX performance -and basic integer math performance- were too weak to be compelling alternatives to its FPU) And yeah, 16-bit resolution introduces more errors and potential artifacts than 32-bit fixed or floating point implementations, but for most stuff at the time it was good enough and worth the performance gain ... outside of the pentium at least.

Reply 101 of 145, by kool kitty89

User metadata
Rank Member
Rank
Member
falloutboy wrote:

I have seen Quake 2 benchmarks with the Athlon cpu loosing performance when 3DNow! is enabled (not sure what GPU was used).
http://web.archive.org/web/200102030623/http: … 3d/drivers.html

Forgot to comment on this before, but the inclusion of GLQuake tests there seems to imply that engine has similar CPU affinity to Quake 2 OpenGL (the Athlon wasn't tested, but given the P6 performance, I'd be surprised if it didn't scale similarly to Quake 2). So the odd performance affinity is probably limited to Quake 1's software renderer alone.

Reply 102 of 145, by kool kitty89

User metadata
Rank Member
Rank
Member

As I got into over at the Cyrix Appreciation Thread: Re: Cyrix appreciation thread

This article: http://www.azillionmonkeys.com/qed/cpuwar.html

Points out that the CXT revision K6-2 added 0-cycle (superscalar) Fxch execution like the P5 and P6, so this would probably be the main gain in FPU performance over the K6 in the benchmarks. (the CXT also fixes the earlier model's lack of pipelined stores, addressing some of the poor memory bandwidth seen in earlier models) On the whole, though, the K6 seems more optimized for low latency than high bandwidth/throughput than the P6 and might be part of the reason it performs relatively well with slow memory bandwidth/throughput scores (and SS7 chipset slowness compared to S370/Slot1), a trend that widened with the Netburst architecture compared to K7. (and the K7 was more bandwidth-friendly/intensive than the K6, but vastly more latency-optimized over bandwidth than Netburst)

It's not just a matter of the short pipeline on the K6, but general fast operation.

There's also quite a few areas the 6x86 and K6 are faster, more advanced, more efficient and just better designed parts on paper but failed to result in real-world gains when fed with P5 or P6-optimized compilers. (from FPU scheduling to complete omission of LOOP instruction use in Intel compilers, it's a serious problem) Then again, Intel continues that trend to this day, intentionally designing compilers that not only favor their own processors, but intentionally cripple competing ones or even disable functionality. (which is technically legal, unless you fail to make developers aware you're doing such -as happened with several lawsuits some years back regarding multimedia extensions being disabled on non-intel CPUs, I think some of the vector processing instructions added in the Phenom)

AMD and Cyrix could/should have promoted their own optimized compilers to compete with this (which would be fairly quick/painless on the developer end to re-compile and offer CPU-specific operating modes for various drivers -mostly on the OS end I'd think, outside proprietary multimedia or video editing programs -and games). Optimized FPU scheduling may very well be why later revisions of Quake II's miniGL performs so much better on the K6-2 even with 3DNow! disabled. (that and possibly better revision of integer operations to favor the K6-2 as well)

These are also the sort of things that, had said processors been used on more closed-box devices (like consoles or non-Windows/DOS home computers -Macintosh/Amiga/Atari ST style/etc) such issues likely wouldn't have materialized as all programs would be oriented towards the single architecture.

Additionally, the 6x86 and K6 had much better legacy support for 16-bit code than the P5 or P6 (obviously more so the PPro but even PII with its enhanced 16-bit code operation) and particularly dramatically so for code not specifically made with Intel compilers or hand-coded using P5 or P6 scheduling rules. (or in short, Cyrix and AMD made better 386/486 code accelerators than Intel did)
For that matter, the 6x86's balanced ALU and FPU execution performance is much better at accelerating code optimized for 486 performance than the P5 is. (as in the 6x86's superscalar integer execution increased roughly proportionally to the FPU execution but the P5 improved FPU performance vastly disproportionately almost to the point of 2:1 disparity -more than that on paper, but roughly so in real world operation)

It did make the 5x86 and Media GX's operation more balanced by comparison. (would've been interesting if they'd made a low-cost gaming/multimedia-oriented companion to the MII out of the MediaGX's core -cut out the DRAM controller and VDC and mate it with the 6x86MX's big cache and S7 FSB and it might have made for a good Winchip-sized core with better ALU and vastly better FPU performance -or ... more like an earlier Winchip2 without 3DNow! and with higher max clock speeds)

I did mistakenly assume the '33 MHz' FSB was a huge bottleneck on the MediaGX, but it's rather misleading given that 'bus' is more like the PCI/DMA/external I/O interface and NOT a memory interface. The memory latency and throughput figures are rather good compared to S7/SS7 6x86/MII performance or several other CPUs, and the performance scales up really well at higher CPU clocks for the Media GX. (the onboard memory controller seems to do rather well on the whole) As such, I'd assume the poor performance scaling at higher clock speeds is due to the small (12kB) L1 cache and lack of L2. Addition of an L2 cache controller and optional board-level cache probably could've pushed it more into S7 or PII/celeron level performance and made integrated AT/ATX implementations of the Media GX more comepetitive in the mainstream. (obviously the bottom-end set-tip boxes wouldn't use that cache, but lower-end mainstream it would've been necessary -that or expanding that L1 to 64 kB when they moved to 250 nm)

Given the memory controller performance and decent integrated video, it seems even more like Cyrix missed the boat going with a system-on-a-chip rather than an integrated chipset design (might have made serious competition for SiS, especially with their relatively modest S7 memory performance) and have CPU+motherboard combinations of various sorts, possibly some surface-mounted. (and have a standalone S7 MediaGX CPU alongside the 6x86 -and have both matched very well to Cyrix's own chipset ... maybe even beating VIA's performance) An in-house chipset certainly would've given more flexibility for oddball FSB speeds too rather than coordinating with chipset and motherboard manufacturers.

For that matter, given IBM had continued to manufacture the old 5x86C into the late 90s, having a low-end embedded 32-bit/486 bus chipset would've made sense too. (not sure if IBM ever die-shrunk the 5x86 or kept it running on the old .65 micron process that whole time ... given the large die-size and relatively low cost of a straight -non optimized- die-shrink, and ability to safely run 350 nm parts at 3.3-3.6V, I could see them more likely spinning off late models to that process rather than wasting the silicon on the old fab -of course, that'd be at the point when .350 was aging a bit and .250 micron was mainstream, around 1998)

Anyway, on the Athlon again, I'm still a bit baffled by its quake software performance and some of the other benchmarks, including the Sandra ones. They don't match up well with the period benchmarks/reviews here:
http://www.pcstats.com/articleview.cfm?articleid=441&page=2
http://www.xbitlabs.com/articles/cpu/display/amd-athlon.html
(granted the former is a Duron 700, but should still be in the ballpark and not account for the vastly poorer Athlon600 scores in Sandra)

That xbitlabs review has some neat details on 3DNow! performance of the Athlon, though, both standard (K6-2 compatible) 3DNow! and the Enhanced extensions the K7 added. Vanilla FPU usage is definitely far slower on the athlon. (and with the Enhanced 3DNow! enabled, it's nearly double the CPU 3DMarks of the raw FPU -and superior to PIII SSE performance at the same clock speed, faster than a 650 MHz PIII for that matter even with the slower standard 3DNow! set)

Edit:
I wonder if Quake at 320x200 would shed any more light on this ... probably not for the Athlon, but perhaps for the P5 vs everything else. (on paper, the only thing consistently faster on the P5 family than P6 on the FPU is Fmul, which should have a bigger impact at low res than high res -given Quake's perspective correction is Fdiv-bound and more CPU intensive at higher resolutions vs low res where Fmul is more significant -for vertex computation; which should also show a bigger dive on Cyrix CPUs as their Fdiv is fast but Fmul -and add and sub and xch- is slow -probably would've favored the K5 a lot more too, given its Fmul is fast and Fdiv is very slow compared to all the others; the Media GX and 486/5x86 probably would've shown better at low-res too given the 32-bit bus is less of a bottleneck)

It also would've been a better 1:1 comparison for Doom.

Unreal's software renderer would've been neat too, but that's probably more worth including in a different benchmark compilation. (one of the best examples of period MMX performance -I don't think the software renderer uses 3DNow! ... but it might; the K6-2's strong MMX performance should come into play though)

Reply 103 of 145, by kool kitty89

User metadata
Rank Member
Rank
Member

Just stumbled on this old Tom's Hardware article on the M2 and noticed the Quake (and overall DOS game performance) figures were far better for the PPro and PII compared to the P55C than the 686 benchmark results manage.

http://www.tomshardware.com/reviews/return-jedi,26.html

It's running 640x480, so same resolution, though it's using Timedemo2 and tests both DOS Quake 1.06 and WinQuake 1.09 (the results seem very similar though). It's not just the P6 performance either, but ever single CPU on their list other than the Pentium MMX does significantly better than the results in the 686 benchmarks. (the MMX-200 manages 15.9 FPS compared to 16.1 in the 686 tests, so slightly slower but a small difference compared to all the others and within a reasonable margin of error for the P55C, particularly given the different Timedemo being run)

Also neat that they listed a motherboard performance scale (at least for the WinNT Benchmark with the M2 at 2.5x75 MHz). Though they don't compare the PA-2005 in that list that Red Hill favored for Cyrix parts (especially at 75 MHz -though more for stability than speed) and I don't think the FIC 502 was even available at the time of that review.

Edit:
The 3DBench scores are also vastly better for the PPro and P2 than the 686 test results.

Reply 104 of 145, by luckybob

User metadata
Rank l33t
Rank
l33t

I may have just stumbled on something here. I don't know if it will invalidate my Pentium pro overdrive results, but I would be negligent to not bring it up.

Quake 1 LOVES fast ram. With my project of getting my P65UP8 board going, I ran quake in dos once more. ( Bitchin' dual p-pro setup ) Mostly to see if I could repeat the results I got. and I did, and I got even better scores, but I found out why. Originally I used a P65UP5, and that board i had 8x 64mb simms from IBM. Long story short, they are 45ns 3.3v edo simms. This simms dont work in the UP8, so I got bog-standard 60ns modules. With these modules and literally default bios settings I get ~16fps. However if I go into the bios setting and torque the ram timings to lowest possible, the fps jumps suddenly to 29.8fps. For the record, the 128mb modules I have are marked 60ns, but the datasheet tells me this is the slowest speed grade and they came in modules capable of 40ns. I did not feel it would be unrealistic to expect these 60ns modules to be capable of tight timings. Also this was done with dos 6.22 and the included memory manager, nothing else loaded.

Also, this is with the shit-tastic S3 trio onboard video. Switching to the matrox 200 made NO difference in FPS. Picture quality was glorious, but identical fps.

image of screen: lq9l8Pmm.jpg

It is a mistake to think you can solve any major problems just with potatoes.

Reply 105 of 145, by clueless1

User metadata
Rank l33t
Rank
l33t

That's insane to get nearly double the framerate with just memory timings. Do you notice any similar performance improvements in other benchmarks?

The more I learn, the more I realize how much I don't know.
OPL3 FM vs. Roland MT-32 vs. General MIDI DOS Game Comparison
Let's benchmark our systems with cache disabled
DOS PCI Graphics Card Benchmarks

Reply 106 of 145, by luckybob

User metadata
Rank l33t
Rank
l33t

All other benchmarks run a little faster than a standard p2-333. Except for ones that really test the memory. Those got massive gains. The benchmarks from aida64 for memory tasks went up considerably. I didn't have the foresight to record everything, but it would be a simple task to reset the bios, and do that.

It is a mistake to think you can solve any major problems just with potatoes.

Reply 107 of 145, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I didn't even know 45 ns EDO SIMMs existed. Were they hard to find? Any idea why they don't work on the P65UP8?

I would be curious to see what cachechk says about your L1/L2/RAM Read/RAM Write speeds with the different CMOS RAM settings you were playing with, that is, the settings which yielded 16 fps and then 30 fps.

cachechk -d -t6
cachechk -d -w -t6

I would have never expected a 2-fold change in benchmark results. It makes me wonder if the cache speed is also being affected by your choice of RAM timings and if cache is working properly.

Your 30 fps result achieved recently is greater than what you supplied for the 686 benchmarks (27.3 fps). Are the settings identical?

Plan your life wisely, you'll be dead before you know it.

Reply 108 of 145, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I ran a few tests on my IBM 5x86-133/2x system in DOS Quake.

320x200

RAM Read Wait State = 1ws (fastest stable)
19.8 fps

RAM Read Wait State = 3ws (Slowest possible)
19.3 fps

I didn't even get close to a 2-fold change in the frame rate. Even if I turn off L2 cache and use 1 ws, I still get 18.6 fps.

Plan your life wisely, you'll be dead before you know it.

Reply 109 of 145, by mrau

User metadata
Rank Oldbie
Rank
Oldbie
feipoa wrote:

I didn't even get close to a 2-fold change in the frame rate.

quite frankly, this is not even intel pentium, your cpu is probably too slow for this to make a visible difference; i bet a ppro in this scenario is just waiting for the main memory most of the time, however i do not understand why, since it has a gigantastic cache;

Reply 110 of 145, by luckybob

User metadata
Rank l33t
Rank
l33t

so, in the name of science, I decided to take a closer look at the bios settings. For the record the 128mb simms are KM44C16104BS-6 (16 chips) and the 64mb simms are IBM FRU 42L0225 with KM44V16104BK-4 (8 chips) The only discernable difference is the 64mb ones are 3.3V. I had NOT realized they were 45ns 3.3 chips until I got the UP8 board. so I've been running them at 5v. Something I plan on not doing anymore.

bios defaults: 6.8 fps

changes:
ide hdd block mode sectors > enabled
mps 1.4 support > enabled
memory auto config > enabled 60ns
cpu-to-ide posting > enabled
uswc write posting > enabled
cpu-to pci write post > enabled
pci-to-dram pipeline > enabled
pci burst write combining > enabled
read-around-write > enabled
onboard serial and parallel disabled

new fps: 15.6

changes:
pci vga palette snoop > enabled

new fps: 15.5

changes:
video memory cache mode: UC > USWC

new fps: 15.5

changes:
16-bit i/o recovery time > 1 busclk (from 4)
8-bit > 1 (from 😎

new fps: 15.6

changes:
dram auto > disabled
dram read 2/3/4 > 2/2/3
dram writ 3/3/3 > 2/2/3
ras precharge 3t > 3t
ras to cas delay 1t > 0t
ma wait state 1 w/s > 0 w/s

new fps: 16.2

changes:
disable onboard S3 video, install Matrox g200

new fps: 29.1

changes:
revert to bios defaults AGAIN.

new fps: 5.1
here on out I made the same changes in the same order
1: 6.2 fps
2: 6.5 fps
3: 26.1 fps AH-HA!
4: 26.1 fps
5: 29.1 fps

Ok so the memory timing made a ~10% difference, but the real star of the show is USWC mode. Thanks to a bit of google:

"This is yet another BIOS feature with a misleading name. It does not cache the video memory or even graphics data (such data is uncacheable anyway). This BIOS feature allows you to control the USWC (Uncached Speculative Write Combining) write combine buffers.

When set to USWC, the write combine buffers will accumulate and combine partial or smaller graphics writes from the processor and write them to the graphics card as burst writes. When set to UC, the write combine buffers will be disabled. All graphics writes from the processor will be written to the graphics card directly.

It is highly recommended that you set this feature to USWC for improved graphics and processor performance. However, if you are using an older graphics card, it may not be compatible with this feature. Enabling this feature with such graphics cards will cause a host of problems like graphics artifacts, system crashes and even the inability to boot up properly. If you face such problems, you should set this BIOS feature to UC immediately."

Obviously the Matrox 200 is compatable with USWC where my onboard S3 virge is not.

It is a mistake to think you can solve any major problems just with potatoes.

Reply 111 of 145, by clueless1

User metadata
Rank l33t
Rank
l33t

So USWC is disabled by default on the motherboard. Is there an Optimized Settings mode in the BIOS that turns it on, or are you left to figure it out on your own to gain such astounding benefits?

The more I learn, the more I realize how much I don't know.
OPL3 FM vs. Roland MT-32 vs. General MIDI DOS Game Comparison
Let's benchmark our systems with cache disabled
DOS PCI Graphics Card Benchmarks

Reply 112 of 145, by mrau

User metadata
Rank Oldbie
Rank
Oldbie

i have never seen this setting in any of my biosen, can it be done if bios does not offer the setting? is it the same thing those nifty mtrr/svga tools trigger when accelerating LFB access?

Reply 113 of 145, by luckybob

User metadata
Rank l33t
Rank
l33t

Like everything else, its up to you to figure it out yourself. It is disabled by default in the bios. Online I see claims that NT4 doesn't play well with USWC (or anything else for that matter).

@mrau
I don't know about your bios. I never really pay much attention to the settings. I have a habit of just turning everything on, and going for it. Supposedly its only useful for software rendering. So it became rather useless as time went on.

It is a mistake to think you can solve any major problems just with potatoes.

Reply 114 of 145, by feipoa

User metadata
Rank l33t++
Rank
l33t++

So USWC gives you the 100% performance boost? I don't think I've ever seen USWC on a motherboard as old as a socket 8. Was this common on socket 7/8 boards and my supply set is just too small?

I have always left USWC disabled because of internet comments of instability, and possible (marginal) slow downs in Windows under certain conditions. Now I am considering turning it on again.

What else I found very interesting is that your PPro-PIIOD-333 scores 5.1-6.8 fps with BIOS defaults enabled. My IBM 5x86-133, with the same graphics card, scores 6.2 fps at 640x480.

Plan your life wisely, you'll be dead before you know it.

Reply 115 of 145, by mrau

User metadata
Rank Oldbie
Rank
Oldbie

@feipoa yes, slowdown are reported, even massive ones; this trigger ellegedly was introduced with ppro, so im not sure a socket7 would have that
i would love to know if this works by default in mobos where its not a trigger in the setup program

Reply 116 of 145, by noop

User metadata
Rank Member
Rank
Member
kool kitty89 wrote:

don't think I've heard of any using MMX based geometry engines

I had a working one for DirectX6, in early 2000s, but never did anything with it 🙁 Very basic - transformations and normal-based diffuse lighting. It performed rather well. And at that time it was hardly useful for anything even if it actually had some advantage over SSE math (but my particular implementation used a bit of SSE as well) Motivation was the absence of T&L support in my videocard (Kyro 2)

Reply 117 of 145, by alvaro84

User metadata
Rank Member
Rank
Member
feipoa wrote:

So USWC gives you the 100% performance boost? I don't think I've ever seen USWC on a motherboard as old as a socket 8. Was this common on socket 7/8 boards and my supply set is just too small?

I have always left USWC disabled because of internet comments of instability, and possible (marginal) slow downs in Windows under certain conditions. Now I am considering turning it on again.

Sorry for the following thread necromancy.

USWC is present in my Asus P3B-F's BIOS too but if it wouldn't I could still achieve the same (or even bigger, I don't know why!) speedup with the Fastvid utility. It can do the trick on most (every?) P6-based system. It even worked on my ISA P4 board, I guess the Northwood has the same memory type range registers as the PPro/2/3.

Video access speed affects greatly gaming benchmarks and its effect grows bigger as frame rates increase. It can easily double a several-hundred-fps 3DBench result. On a strong P3 system it sends it through the roof and make it a guesswork because 3DBench can't properly display frame rates over 999.9 🤣

Shame on us, doomed from the start
May God have mercy on our dirty little hearts

Reply 118 of 145, by ruthan

User metadata
Rank Oldbie
Rank
Oldbie

Great work, i love these long graphs in picture format.

So details to make it more ultimate 😀

  1. It would be nice to compare more Super Socket 7 chipsets, it could be big difference, i remember Quake III (i would be nice to add it too to suite) benchmark from Anandtech from this period where difference between slowest and fastest chipset was 30%..
    There only 3 chipsets as far as know VIA MVP3, ALI Alladdin V (probably for admiral general:), SIS 530/5595 (i newer saw something with AGP - because some AGP lowend videocard is already integrated, but it could manage 1.5 GB of RAM its 2x more than VIA/ALI), there should be some SIS chipset with AGP - iS 5581/5582 ISA, PCI, AGP, SiS 5591/5595 ISA, PCI, AGP , i never met MB with them.
  2. How about memory, in dind find details about there setting in first post, maybe they are somewhere in Excel sheet, was used same memory speed for all cpus, or is memory speed always same as FSB speed? I dont remember if there were possible on some SS7 MB set memory speed higher that FSB.. but if not you can still use some PC-133 SD RAM, which would probably at 100 MHz give some nice timmings, much better than Pentiums EDO Rams.
  3. I know that this is CPU benchark, but i really liked that world fastest 486 attitude, point of view. Its nice to know which cpus are best, most of us are searching for best CPU + Videocard combo, and diffent CPUs and videocards scales differently. So it would be nice to nice these cpus combined with something else than Matrox G200.. at least with other major players - Voodoo 3 / TNT 2 / Geforce 1/2/3/4 , maybe Kyro and Savage.. in benchmarks are only 3 games and we dont need benchmark every cpu and frequency, only few most typical configs.
    BTW just for fun, there are very fast AGP cards.. Like Radeons 1300/1600/1950 + Radeons HD 2400 /2600 / 3650 /3850 there these working with super socket 7? I thing that Phil already tested K6-III + Geforce 5500. I dunno about AGP Voltage, if these MBs have or havent right voltage for these more modern cards, maybe is there some voltage mod.
    But if they have in Bios Primary Videocard slot option, you can have 2 cards, 1 PCI for DOS or maybe DOS+Win98 and one for XP/ Linux etc..
  4. Is there big difference with faster cards between PCI and AGP with these CPUs and chipsets?
  5. Fastvid / Rayers enchander - it would be nice to know, that there are at least working with these chipsets, because is not granted, there are quite few not compatible combinations.
  6. Quake 2 - 3d now was very tuned for 3dfx cards, i thing that i was much faster for Vooodoo 3 too with my K6-2 300 there was almost double framerate back in the day.. day and night difference.

    Here is list of games which are supporting 3Dnow, is not short:
    https://web.archive.org/web/20001109071400/ht … dnow/optimized/

Im old goal oriented goatman, i care about facts and freedom, not about egos+prejudices. Hoarding=sickness. If you want respect, gain it by your behavior. I hate stupid SW limits, SW=virtual world, everything should be possible if you have enough raw HW.

Reply 119 of 145, by kool kitty89

User metadata
Rank Member
Rank
Member

Even aside from the weirdly high Socket 8 Overdrive performance of Quake 1, has anyone any idea why the Athlon seems to do so poorly with Quake? The PII, III, and Pro might scale quite as expected vs P5 family, but the Athlon does really poorly for what should be a very Pentium-friendly FPU and very fast ALU core.

Could it be a chipset related issue?

I have a pair of old dual CPU Socket 462 ATX server/workstation boards with AMD's own 760 chipset if that might test differently than a VIA based one. That and DDR vs SDR VIA chipsets might shed light on this and/or NForce/Nforce2 based ones. (I have some NForce 2 boards left over from my and my dad's old Athlon XP builds that we never sold off ... and Dad got working again a couple years ago as one or both had corrupt BIOS chips: I think we actually have 1 spare to help with hot-swap flashing those and/or he got a USB PLLC socketed flash programmer to do that the proper way)

Not as insteresting for retro builds that demand good Sound Blaster and OPL3 compatibility, but certainly solid early 2000s era hardware. (at some point I also realized I made the mistake of not building up that NForce2 machine more ... it had only an Athlon XP-1600+ at stock 1.4GHz 266 fsb installed, 768 MB of DDR, a Radeon 9600SE and a dying 60 GB HDD when I retired it for a desktop replacement turion dual core Geforce 7150m based HP 9000 series ... and maxing out that old board with a Barton Athlon/Sempron, 333 or overclocked FSB, more RAM, decent GPU and a new HDD would've outstripped that laptop for almost everything but really multi-core demanding/cpu bottlenecked stuff ... ie almost anything/everything that was actually playable on that 7150M ... also a more portable and less expensive notebook would've been nice, but that 9000 whatever had a decent keyboard by laptop standards at least)

Oh, also:

The 45 vs 60 ns EDO RAM timing might possibly not be in random access wait state or timing difference (mostly), but burst/page-mode cycle timing.

It it works down to single-cycle page mode timing, that would be 2x as fast as typical EDO RAM timing and the same as typical BEDO timing, though a board/chipset could potentially support a 1-clock EDO/page-mode cycle time without explicit BEDO support or could work with non-BEDO RAM that just tolerates the timing.

With enough tweak options, you could potentially get EDO latency and throughput to typical SDRAM levels. (ie 5-1-1-1 or even 3-1-1-1 burst times, though probably not 2-1-1-1 except maybe at 50 MHz FSB)

And EDO RAM was (or maybe still is) manufactured well faster than 45 ns, at least down to 35 ns, but it wasn't typically used for SIMMs or EDO DIMMs but for video card RAM or some embedded system use (soldered-on RAM). And by the time it was common to have high yields of the fast timings, I think SDRAM had totally displaced it in the consumer PC market.

(I wonder if late production model Playstation game consoles have unnecessarily fast EDO RAM as their main memory just due to cheap supplies of such, like Sega's Mega Drive used faster than necessary SRAM and PSRAM for its later models ... though actually switched to embedded SRAM for the embedded Z80 and a unified SDRAM bus with a 2-bank 128kx16bit SDRAM chip in place of the 8-bit VRAM and 16-bit PSRAM for the final couple hardware revisions)

Additionally, some older chipsets (especially 486 and 386 chipsets) have limited or no support for page-mode read/writes at all, at least judging by the memory bandwidth benchmarks I see. (potentially, a 'smart' memory controller could even take advantage of page-mode cycles for sequential reads/writes on as far back as 8088s, 286s, and 386s, and for fast prefetch fill and speeding up 16-bit bus cycles on the 8088 or 32-bit ones on the 386SX: you'd need wide latches or FIFOs in the chipset or as TTL chips to handle that)

The bandwidth and read and write cycle times I'm seeing on my Opti 495 with 40 MHz FSB and '0 wait state' DRAM read/write settings looks quite poor, really. Writes are within the specs for non page mode cycle times (RC times) for 60 or even 70 ns DRAM at about 160 ns and reads are twice as long at over 320 ns. This seems really, really slow for reads and is signifantly slower than the read/write cycle times the Atari ST and Amiga did back in 1985 with 150 ns NMOS DRAM. (just under 280 ns for the Amiga and under 250 ns for the ST, though actual CPU bus cycles were 2x that long and the chipset does 2 reads and/or writes in 560 or 500 ns with even/odd cycles split between video DMA and CPU: the Amiga can also use 'spare' video DMA slots for blitter operations while the ST/STe does all of its disk DMA and blitter ops on the CPU cycle slots)

OTOH this may also be an artifact for the way the benchmarks I'm using do their memory tests. (and doesn't do special cases like cache fill burst reads or such)

I have no idea if there are any fast/smart 16-bit chipsets out there, but you could have arbitrary linear-burst times for RAM on a 286 or 386SX (or 8086/V30) that simply kept the DRAM row held open so long as addresses remained within that range (so not just linear sequential reads or writes, but any random reads or writes that took place within that row of the DRAM array). So even on a 286 you'd gain the advantage of zero wait state operations for page-mode cycles and getting a wait here or there when you saw a page break.
The logic associated with that isn't very complicated and while it would screw up some cycle-timed code, it also should be simple to disable in the BIOS.

There's also potential optimization for bank-interleaved DRAM timing with or without page-mode use on top of that. (so you get some overlap in read-write cycles and better yet: each bank can have its own page held open, so you get page-mode burst timing so long as the code keeps requesting data from the same page in each bank: super, super useful for texture mapped software renderers, for example, where you'd want to organize the textures and framebuffer space in separate banks)

Quake may be written such that it takes advantage of RAM organization to maximize use of page-mode reads and writes and (potentially) bank interleaving as well.

I'm not sure how its textures are formatted internally (ie as installed, loaded game files in the active game engine), but having the texel arrays packed into long lines of pixels would make chipsets with good page-mode support gain a ton of performance over random reads/writes. (you'd also see a bigger jump from FPM to EDO timing in those systems)

Quake might also optimize for multi-bank DRAM controllers by organizing the texture storage and framebuffer regions along likely bank boundaries (or might check the OS or BIOS for bank address boundaries) but even in a single-bank (single page) handling system, you have tons of buffering via the caches and FPU registers (that quake renders texture spans to) so you should get close to the peak page-mode bandwidth even with just a single bank available.

Optimizing for page-mode burst cycles would be about as important as optimizing for the full 64-bit data bus width on the P5 platform, and making sure to pipeline/buffer rendering operations to make as much use of that as possible. (incidentially, the same is true for the Atari Jaguar version of Quake Carmak was working on, though it lacks any hardware caches and would exploit registers and embedded scratchpad RAM or 'spare' line buffer area to work around that; unbuffered/uncached textures take 11 cycles per texel for the blitter to render, but peak blitter texture mapping throughput is 5 cycles according to some tests Kskunk did, though it was speculated to be faster prior to that; that's the same bottleneck as for scaled/rotated blitter objects as that's all it's texture mapping feature does: affine line rendering or 2D bitmap rendering, much slower than scaled sprites using the object processor)

Incidentally, the Jag chipset actually has a dual-bank DRAM controller with 2 2MB address regions mapped to those (also the ROM and external I/O area counts as a 3rd, separate bank), but only a single 64-bit wide bank was populated (to a fairly generous 2MB) and the other was left unused. (the arcade CoJag unit populated the second bank with dual-port VRAM I believe, I forget why ... or if it used that for a second video controller's framebuffer with genlock: the few games in development for it had HDD-streamed animated/FMV backgrounds, so it would make sense)

For that matter, even Doom takes advantage of multi-bank rendering as it draws directly to VGA RAM (or draws 2-pixel lines using VGA fill commands) so can potentially do page-mode reads of texels and be bottlenecked mostly buy VGA write wait states. (it has the disadvantage of still writing one pixel at a time, so a faster VGA bus helps, but a wider one does little to nothing: I think a 16-bit bus might speed up VGA register writes for fills, but for the high-detail mode, but otherewise the wider bus of ISA/VLB/PCI would just help with the limitations of 386 word-aligned addressing: ie Doom just uses an 8-bit pixel pipeline for its rendering)

Well, the highcolor Doom renderers (Jaguar, 32x, and I think 3DO) do 16-bit pixel pipelines, but the speed would be the same regardless. (just that an 8-bit wide VGA card would see a much more dramatic hit if the PC version supported 16-bit pixels ... or if it ran in a linear mode 13h and did block copies from a back buffer in system RAM)

The spans (floor and ceiling) rendered in Doom might also benefit from page-mode operation if the VGA card happens to support it. (relevant to VRAM and DRAM alike, though not oddball cards using SRAM/PSRAM ... some of ATi's CGA/Plantronics cards used that, not sure if anything VGA did ... maybe some low-cost 64kB VGA cards used PSRAM; I'd think the cost benefits of board/chip complexity vs RAM cost would nix that for 256kB or larger VGA cards)