Even aside from the weirdly high Socket 8 Overdrive performance of Quake 1, has anyone any idea why the Athlon seems to do so poorly with Quake? The PII, III, and Pro might scale quite as expected vs P5 family, but the Athlon does really poorly for what should be a very Pentium-friendly FPU and very fast ALU core.
Could it be a chipset related issue?
I have a pair of old dual CPU Socket 462 ATX server/workstation boards with AMD's own 760 chipset if that might test differently than a VIA based one. That and DDR vs SDR VIA chipsets might shed light on this and/or NForce/Nforce2 based ones. (I have some NForce 2 boards left over from my and my dad's old Athlon XP builds that we never sold off ... and Dad got working again a couple years ago as one or both had corrupt BIOS chips: I think we actually have 1 spare to help with hot-swap flashing those and/or he got a USB PLLC socketed flash programmer to do that the proper way)
Not as insteresting for retro builds that demand good Sound Blaster and OPL3 compatibility, but certainly solid early 2000s era hardware. (at some point I also realized I made the mistake of not building up that NForce2 machine more ... it had only an Athlon XP-1600+ at stock 1.4GHz 266 fsb installed, 768 MB of DDR, a Radeon 9600SE and a dying 60 GB HDD when I retired it for a desktop replacement turion dual core Geforce 7150m based HP 9000 series ... and maxing out that old board with a Barton Athlon/Sempron, 333 or overclocked FSB, more RAM, decent GPU and a new HDD would've outstripped that laptop for almost everything but really multi-core demanding/cpu bottlenecked stuff ... ie almost anything/everything that was actually playable on that 7150M ... also a more portable and less expensive notebook would've been nice, but that 9000 whatever had a decent keyboard by laptop standards at least)
The 45 vs 60 ns EDO RAM timing might possibly not be in random access wait state or timing difference (mostly), but burst/page-mode cycle timing.
It it works down to single-cycle page mode timing, that would be 2x as fast as typical EDO RAM timing and the same as typical BEDO timing, though a board/chipset could potentially support a 1-clock EDO/page-mode cycle time without explicit BEDO support or could work with non-BEDO RAM that just tolerates the timing.
With enough tweak options, you could potentially get EDO latency and throughput to typical SDRAM levels. (ie 5-1-1-1 or even 3-1-1-1 burst times, though probably not 2-1-1-1 except maybe at 50 MHz FSB)
And EDO RAM was (or maybe still is) manufactured well faster than 45 ns, at least down to 35 ns, but it wasn't typically used for SIMMs or EDO DIMMs but for video card RAM or some embedded system use (soldered-on RAM). And by the time it was common to have high yields of the fast timings, I think SDRAM had totally displaced it in the consumer PC market.
(I wonder if late production model Playstation game consoles have unnecessarily fast EDO RAM as their main memory just due to cheap supplies of such, like Sega's Mega Drive used faster than necessary SRAM and PSRAM for its later models ... though actually switched to embedded SRAM for the embedded Z80 and a unified SDRAM bus with a 2-bank 128kx16bit SDRAM chip in place of the 8-bit VRAM and 16-bit PSRAM for the final couple hardware revisions)
Additionally, some older chipsets (especially 486 and 386 chipsets) have limited or no support for page-mode read/writes at all, at least judging by the memory bandwidth benchmarks I see. (potentially, a 'smart' memory controller could even take advantage of page-mode cycles for sequential reads/writes on as far back as 8088s, 286s, and 386s, and for fast prefetch fill and speeding up 16-bit bus cycles on the 8088 or 32-bit ones on the 386SX: you'd need wide latches or FIFOs in the chipset or as TTL chips to handle that)
The bandwidth and read and write cycle times I'm seeing on my Opti 495 with 40 MHz FSB and '0 wait state' DRAM read/write settings looks quite poor, really. Writes are within the specs for non page mode cycle times (RC times) for 60 or even 70 ns DRAM at about 160 ns and reads are twice as long at over 320 ns. This seems really, really slow for reads and is signifantly slower than the read/write cycle times the Atari ST and Amiga did back in 1985 with 150 ns NMOS DRAM. (just under 280 ns for the Amiga and under 250 ns for the ST, though actual CPU bus cycles were 2x that long and the chipset does 2 reads and/or writes in 560 or 500 ns with even/odd cycles split between video DMA and CPU: the Amiga can also use 'spare' video DMA slots for blitter operations while the ST/STe does all of its disk DMA and blitter ops on the CPU cycle slots)
OTOH this may also be an artifact for the way the benchmarks I'm using do their memory tests. (and doesn't do special cases like cache fill burst reads or such)
I have no idea if there are any fast/smart 16-bit chipsets out there, but you could have arbitrary linear-burst times for RAM on a 286 or 386SX (or 8086/V30) that simply kept the DRAM row held open so long as addresses remained within that range (so not just linear sequential reads or writes, but any random reads or writes that took place within that row of the DRAM array). So even on a 286 you'd gain the advantage of zero wait state operations for page-mode cycles and getting a wait here or there when you saw a page break.
The logic associated with that isn't very complicated and while it would screw up some cycle-timed code, it also should be simple to disable in the BIOS.
There's also potential optimization for bank-interleaved DRAM timing with or without page-mode use on top of that. (so you get some overlap in read-write cycles and better yet: each bank can have its own page held open, so you get page-mode burst timing so long as the code keeps requesting data from the same page in each bank: super, super useful for texture mapped software renderers, for example, where you'd want to organize the textures and framebuffer space in separate banks)
Quake may be written such that it takes advantage of RAM organization to maximize use of page-mode reads and writes and (potentially) bank interleaving as well.
I'm not sure how its textures are formatted internally (ie as installed, loaded game files in the active game engine), but having the texel arrays packed into long lines of pixels would make chipsets with good page-mode support gain a ton of performance over random reads/writes. (you'd also see a bigger jump from FPM to EDO timing in those systems)
Quake might also optimize for multi-bank DRAM controllers by organizing the texture storage and framebuffer regions along likely bank boundaries (or might check the OS or BIOS for bank address boundaries) but even in a single-bank (single page) handling system, you have tons of buffering via the caches and FPU registers (that quake renders texture spans to) so you should get close to the peak page-mode bandwidth even with just a single bank available.
Optimizing for page-mode burst cycles would be about as important as optimizing for the full 64-bit data bus width on the P5 platform, and making sure to pipeline/buffer rendering operations to make as much use of that as possible. (incidentially, the same is true for the Atari Jaguar version of Quake Carmak was working on, though it lacks any hardware caches and would exploit registers and embedded scratchpad RAM or 'spare' line buffer area to work around that; unbuffered/uncached textures take 11 cycles per texel for the blitter to render, but peak blitter texture mapping throughput is 5 cycles according to some tests Kskunk did, though it was speculated to be faster prior to that; that's the same bottleneck as for scaled/rotated blitter objects as that's all it's texture mapping feature does: affine line rendering or 2D bitmap rendering, much slower than scaled sprites using the object processor)
Incidentally, the Jag chipset actually has a dual-bank DRAM controller with 2 2MB address regions mapped to those (also the ROM and external I/O area counts as a 3rd, separate bank), but only a single 64-bit wide bank was populated (to a fairly generous 2MB) and the other was left unused. (the arcade CoJag unit populated the second bank with dual-port VRAM I believe, I forget why ... or if it used that for a second video controller's framebuffer with genlock: the few games in development for it had HDD-streamed animated/FMV backgrounds, so it would make sense)
For that matter, even Doom takes advantage of multi-bank rendering as it draws directly to VGA RAM (or draws 2-pixel lines using VGA fill commands) so can potentially do page-mode reads of texels and be bottlenecked mostly buy VGA write wait states. (it has the disadvantage of still writing one pixel at a time, so a faster VGA bus helps, but a wider one does little to nothing: I think a 16-bit bus might speed up VGA register writes for fills, but for the high-detail mode, but otherewise the wider bus of ISA/VLB/PCI would just help with the limitations of 386 word-aligned addressing: ie Doom just uses an 8-bit pixel pipeline for its rendering)
Well, the highcolor Doom renderers (Jaguar, 32x, and I think 3DO) do 16-bit pixel pipelines, but the speed would be the same regardless. (just that an 8-bit wide VGA card would see a much more dramatic hit if the PC version supported 16-bit pixels ... or if it ran in a linear mode 13h and did block copies from a back buffer in system RAM)
The spans (floor and ceiling) rendered in Doom might also benefit from page-mode operation if the VGA card happens to support it. (relevant to VRAM and DRAM alike, though not oddball cards using SRAM/PSRAM ... some of ATi's CGA/Plantronics cards used that, not sure if anything VGA did ... maybe some low-cost 64kB VGA cards used PSRAM; I'd think the cost benefits of board/chip complexity vs RAM cost would nix that for 256kB or larger VGA cards)