As I got into over at the Cyrix Appreciation Thread:
viewtopic.php?f=25&t=31111&p=483045#p483045This article:
http://www.azillionmonkeys.com/qed/cpuwar.htmlPoints out that the CXT revision K6-2 added 0-cycle (superscalar) Fxch execution like the P5 and P6, so this would probably be the main gain in FPU performance over the K6 in the benchmarks. (the CXT also fixes the earlier model's lack of pipelined stores, addressing some of the poor memory bandwidth seen in earlier models) On the whole, though, the K6 seems more optimized for low latency than high bandwidth/throughput than the P6 and might be part of the reason it performs relatively well with slow memory bandwidth/throughput scores (and SS7 chipset slowness compared to S370/Slot1), a trend that widened with the Netburst architecture compared to K7. (and the K7 was more bandwidth-friendly/intensive than the K6, but vastly more latency-optimized over bandwidth than Netburst)
It's not just a matter of the short pipeline on the K6, but general fast operation.
There's also quite a few areas the 6x86 and K6 are faster, more advanced, more efficient and just better designed parts on paper but failed to result in real-world gains when fed with P5 or P6-optimized compilers. (from FPU scheduling to complete omission of LOOP instruction use in Intel compilers, it's a serious problem) Then again, Intel continues that trend to this day, intentionally designing compilers that not only favor their own processors, but intentionally cripple competing ones or even disable functionality. (which is technically legal, unless you fail to make developers aware you're doing such -as happened with several lawsuits some years back regarding multimedia extensions being disabled on non-intel CPUs, I think some of the vector processing instructions added in the Phenom)
AMD and Cyrix could/should have promoted their own optimized compilers to compete with this (which would be fairly quick/painless on the developer end to re-compile and offer CPU-specific operating modes for various drivers -mostly on the OS end I'd think, outside proprietary multimedia or video editing programs -and games). Optimized FPU scheduling may very well be why later revisions of Quake II's miniGL performs so much better on the K6-2 even with 3DNow! disabled. (that and possibly better revision of integer operations to favor the K6-2 as well)
These are also the sort of things that, had said processors been used on more closed-box devices (like consoles or non-Windows/DOS home computers -Macintosh/Amiga/Atari ST style/etc) such issues likely wouldn't have materialized as all programs would be oriented towards the single architecture.
Additionally, the 6x86 and K6 had much better legacy support for 16-bit code than the P5 or P6 (obviously more so the PPro but even PII with its enhanced 16-bit code operation) and particularly dramatically so for code not specifically made with Intel compilers or hand-coded using P5 or P6 scheduling rules. (or in short, Cyrix and AMD made better 386/486 code accelerators than Intel did)
For that matter, the 6x86's balanced ALU and FPU execution performance is much better at accelerating code optimized for 486 performance than the P5 is. (as in the 6x86's superscalar integer execution increased roughly proportionally to the FPU execution but the P5 improved FPU performance vastly disproportionately almost to the point of 2:1 disparity -more than that on paper, but roughly so in real world operation)
It did make the 5x86 and Media GX's operation more balanced by comparison. (would've been interesting if they'd made a low-cost gaming/multimedia-oriented companion to the MII out of the MediaGX's core -cut out the DRAM controller and VDC and mate it with the 6x86MX's big cache and S7 FSB and it might have made for a good Winchip-sized core with better ALU and vastly better FPU performance -or ... more like an earlier Winchip2 without 3DNow! and with higher max clock speeds)
I did mistakenly assume the '33 MHz' FSB was a huge bottleneck on the MediaGX, but it's rather misleading given that 'bus' is more like the PCI/DMA/external I/O interface and NOT a memory interface. The memory latency and throughput figures are rather good compared to S7/SS7 6x86/MII performance or several other CPUs, and the performance scales up really well at higher CPU clocks for the Media GX. (the onboard memory controller seems to do rather well on the whole) As such, I'd assume the poor performance scaling at higher clock speeds is due to the small (12kB) L1 cache and lack of L2. Addition of an L2 cache controller and optional board-level cache probably could've pushed it more into S7 or PII/celeron level performance and made integrated AT/ATX implementations of the Media GX more comepetitive in the mainstream. (obviously the bottom-end set-tip boxes wouldn't use that cache, but lower-end mainstream it would've been necessary -that or expanding that L1 to 64 kB when they moved to 250 nm)
Given the memory controller performance and decent integrated video, it seems even more like Cyrix missed the boat going with a system-on-a-chip rather than an integrated chipset design (might have made serious competition for SiS, especially with their relatively modest S7 memory performance) and have CPU+motherboard combinations of various sorts, possibly some surface-mounted. (and have a standalone S7 MediaGX CPU alongside the 6x86 -and have both matched very well to Cyrix's own chipset ... maybe even beating VIA's performance) An in-house chipset certainly would've given more flexibility for oddball FSB speeds too rather than coordinating with chipset and motherboard manufacturers.
For that matter, given IBM had continued to manufacture the old 5x86C into the late 90s, having a low-end embedded 32-bit/486 bus chipset would've made sense too. (not sure if IBM ever die-shrunk the 5x86 or kept it running on the old .65 micron process that whole time ... given the large die-size and relatively low cost of a straight -non optimized- die-shrink, and ability to safely run 350 nm parts at 3.3-3.6V, I could see them more likely spinning off late models to that process rather than wasting the silicon on the old fab -of course, that'd be at the point when .350 was aging a bit and .250 micron was mainstream, around 1998)
Anyway, on the Athlon again, I'm still a bit baffled by its quake software performance and some of the other benchmarks, including the Sandra ones. They don't match up well with the period benchmarks/reviews here:
http://www.pcstats.com/articleview.cfm? ... 441&page=2http://www.xbitlabs.com/articles/cpu/di ... thlon.html(granted the former is a Duron 700, but should still be in the ballpark and not account for the vastly poorer Athlon600 scores in Sandra)
That xbitlabs review has some neat details on 3DNow! performance of the Athlon, though, both standard (K6-2 compatible) 3DNow! and the Enhanced extensions the K7 added. Vanilla FPU usage is definitely far slower on the athlon. (and with the Enhanced 3DNow! enabled, it's nearly double the CPU 3DMarks of the raw FPU -and superior to PIII SSE performance at the same clock speed, faster than a 650 MHz PIII for that matter even with the slower standard 3DNow! set)
Edit:
I wonder if Quake at 320x200 would shed any more light on this ... probably not for the Athlon, but perhaps for the P5 vs everything else. (on paper, the only thing consistently faster on the P5 family than P6 on the FPU is Fmul, which should have a bigger impact at low res than high res -given Quake's perspective correction is Fdiv-bound and more CPU intensive at higher resolutions vs low res where Fmul is more significant -for vertex computation; which should also show a bigger dive on Cyrix CPUs as their Fdiv is fast but Fmul -and add and sub and xch- is slow -probably would've favored the K5 a lot more too, given its Fmul is fast and Fdiv is very slow compared to all the others; the Media GX and 486/5x86 probably would've shown better at low-res too given the 32-bit bus is less of a bottleneck)
It also would've been a better 1:1 comparison for Doom.
Unreal's software renderer would've been neat too, but that's probably more worth including in a different benchmark compilation. (one of the best examples of period MMX performance -I don't think the software renderer uses 3DNow! ... but it might; the K6-2's strong MMX performance should come into play though)