Cyrix DX2-80 slower than DX2-66?

Reply 40 of 48, by Anonymous Coward

Posted on 2025-05-01, 12:02

Anonymous Coward Offline

Rank l33t++

Rank: l33t++
Posts: 5023
Joined: 2008-03-20, 05:37
Location: Shandong, China

That's a pretty interesting article. It explains why the Cyrix DX-50 was a thing. Cyrix was hoping VLB 2.0 would take off...

"Will the highways on the internets become more few?" -Gee Dubya
V'Ger XT|Upgraded AT|Ultimate 386|Super VL/EISA 486|SMP VL/EISA Pentium

Reply 41 of 48, by Deunan

Posted on 2025-05-01, 22:56

Deunan Offline

Rank l33t

Rank: l33t
Posts: 2155
Joined: 2018-05-29, 12:32

fsinan wrote on 2025-05-01, 09:32:

A rumour, that is not written in this analysis is Cyrix has slower internal cache speeds, equal to bus speed as opposed to Intel.

I've dug up some CACHECHK results I did for Cyrix and Intel, both at 66 (33x2) MHz. So here's what it looks like:

Cyrix at 66MHz:

1       1    2    4    8   16   32   64  128  256  512 1024 2048 4096 <-- KB
2 0:   20   20   20   20   22   22   22   22   22   30   --   --   --    us/KB
3 1:   20   20   20   20   22   22   22   22   22   30   30   30   30    us/KB

Intel at 66MHz:

1       1    2    4    8   16   32   64  128  256  512 1024 2048 4096 <-- KB
2 0:   16   16   16   16   24   24   24   24   24   32   --   --   --    us/KB
3 1:   16   16   16   16   24   24   24   24   24   32   32   32   32    us/KB

It would seem I misremembered L1 being faster on Cyrix. It's actually the memory bus that's faster, and L1 being slower. So depending on the size of the code/data Cyrix might end up slower or faster. Doom, which was apparently written to maximize 486 cache, ends up being a tad bit faster on Intel: 1085 tics vs 1177 (Trident 8900D, I didn't test the Cyrix with VLB card as it was not stable at 40MHz FSB). But Speedsys gives Intel 25.20 score and Cyrix gets 27.02.
Well, the results are close enough not to care frankly, and Cyrix was cheaper and offered (eventually) 80MHz version. Which runs hot so has this nice green heatsink on most chips.

fsinan wrote on 2025-05-01, 09:32:

FPU can not be considered "slower" for every instance, in fact, it is called faster according to this report analysis. It can be wrong.

Yes, it's a pity they didn't post the actual results. Whetstones benchmark is not that great as it will depend heavily on the compiler used. If the code was compiled for 286/287 compatible path then Cyrix would probably end up on top, as with my own tests. And it's not a bad thing as such, you make the chips run fast for the code that's already out there. But as I've said it really depends and 387+ optimized code seems to be faster on Intel, at least for me.

Reply 42 of 48, by Anonymous Coward

Posted on 2025-05-01, 23:48

Anonymous Coward Offline

Rank l33t++

Rank: l33t++
Posts: 5023
Joined: 2008-03-20, 05:37
Location: Shandong, China

Deunan wrote on 2025-05-01, 22:56:

It would seem I misremembered L1 being faster on Cyrix. It's actually the memory bus that's faster, and L1 being slower. So depending on the size of the code/data Cyrix might end up slower or faster. Doom, which was apparently written to maximize 486 cache, ends up being a tad bit faster on Intel: 1085 tics vs 1177 (Trident 8900D, I didn't test the Cyrix with VLB card as it was not stable at 40MHz FSB). But Speedsys gives Intel 25.20 score and Cyrix gets 27.02.
Well, the results are close enough not to care frankly, and Cyrix was cheaper and offered (eventually) 80MHz version. Which runs hot so has this nice green heatsink on most chips.

There were at least two versions of the 80MHz part. The 5V was probably pretty toasty. The 3.3V one was probably okay.

"Will the highways on the internets become more few?" -Gee Dubya
V'Ger XT|Upgraded AT|Ultimate 386|Super VL/EISA 486|SMP VL/EISA Pentium

Reply 43 of 48, by fsinan

Posted on 2025-05-02, 09:47

fsinan Offline

Rank Member

Rank: Member
Posts: 134
Joined: 2024-01-09, 19:35
Location: Turkiye

Anonymous Coward wrote on 2025-05-01, 23:48:

Deunan wrote on 2025-05-01, 22:56:

It would seem I misremembered L1 being faster on Cyrix. It's actually the memory bus that's faster, and L1 being slower. So depending on the size of the code/data Cyrix might end up slower or faster. Doom, which was apparently written to maximize 486 cache, ends up being a tad bit faster on Intel: 1085 tics vs 1177 (Trident 8900D, I didn't test the Cyrix with VLB card as it was not stable at 40MHz FSB). But Speedsys gives Intel 25.20 score and Cyrix gets 27.02.
Well, the results are close enough not to care frankly, and Cyrix was cheaper and offered (eventually) 80MHz version. Which runs hot so has this nice green heatsink on most chips.

There were at least two versions of the 80MHz part. The 5V was probably pretty toasty. The 3.3V one was probably okay.

Adding note, mine is normal 5v version with green heatsink but it works perfectly at Cx486DX2-V80 edition settings which is 4V. Silicon lottery I think, this drops the temp from 54-55 to 41-42 max.

System:1
Cyrix 5x86-120GP & X5-160ADZ
Lucky Star LS-486E
System:2
Intel DX4-WB & AMDDX4-120
PcChips M912 V1.7
System:3
AMD K6-2-475 & Cyrix 6x86MX PR-233
Asus P5A-B
System:4
UMC U5S-40
486UL-P101
System:5
P3 Coppermine 800EB
Gigabyte GA-6BX7

Reply 44 of 48, by fsinan

Posted on 2025-05-02, 09:57

fsinan Offline

Rank Member

Rank: Member
Posts: 134
Joined: 2024-01-09, 19:35
Location: Turkiye

Deunan wrote on 2025-05-01, 22:56:
I've dug up some CACHECHK results I did for Cyrix and Intel, both at 66 (33x2) MHz. So here's what it looks like: […]
Show full quote

fsinan wrote on 2025-05-01, 09:32:

A rumour, that is not written in this analysis is Cyrix has slower internal cache speeds, equal to bus speed as opposed to Intel.

I've dug up some CACHECHK results I did for Cyrix and Intel, both at 66 (33x2) MHz. So here's what it looks like:

Cyrix at 66MHz:
1       1    2    4    8   16   32   64  128  256  512 1024 2048 4096 <-- KB
2 0:   20   20   20   20   22   22   22   22   22   30   --   --   --    us/KB
3 1:   20   20   20   20   22   22   22   22   22   30   30   30   30    us/KB
Intel at 66MHz:
1       1    2    4    8   16   32   64  128  256  512 1024 2048 4096 <-- KB
2 0:   16   16   16   16   24   24   24   24   24   32   --   --   --    us/KB
3 1:   16   16   16   16   24   24   24   24   24   32   32   32   32    us/KB
It would seem I misremembered L1 being faster on Cyrix. It's actually the memory bus that's faster, and L1 being slower. So depending on the size of the code/data Cyrix might end up slower or faster. Doom, which was apparently written to maximize 486 cache, ends up being a tad bit faster on Intel: 1085 tics vs 1177 (Trident 8900D, I didn't test the Cyrix with VLB card as it was not stable at 40MHz FSB). But Speedsys gives Intel 25.20 score and Cyrix gets 27.02.
Well, the results are close enough not to care frankly, and Cyrix was cheaper and offered (eventually) 80MHz version. Which runs hot so has this nice green heatsink on most chips.

fsinan wrote on 2025-05-01, 09:32:

FPU can not be considered "slower" for every instance, in fact, it is called faster according to this report analysis. It can be wrong.

Yes, it's a pity they didn't post the actual results. Whetstones benchmark is not that great as it will depend heavily on the compiler used. If the code was compiled for 286/287 compatible path then Cyrix would probably end up on top, as with my own tests. And it's not a bad thing as such, you make the chips run fast for the code that's already out there. But as I've said it really depends and 387+ optimized code seems to be faster on Intel, at least for me.

My results confirms this, take a look at the speedsys results for Cx486DX2-80 vs AMD486DX2-80 with same board and same memory settings on M912V1.7 board.

Cyrix is considerably slower at L1 cache speed but a little faster for L2 and memory access. But with very little differences. AMD is as it is a copy of Intel, far better at L1 cache speed. 58MB vs 63Mb/s.

System:1
Cyrix 5x86-120GP & X5-160ADZ
Lucky Star LS-486E
System:2
Intel DX4-WB & AMDDX4-120
PcChips M912 V1.7
System:3
AMD K6-2-475 & Cyrix 6x86MX PR-233
Asus P5A-B
System:4
UMC U5S-40
486UL-P101
System:5
P3 Coppermine 800EB
Gigabyte GA-6BX7

Reply 45 of 48, by BitWrangler

Posted on 2025-05-02, 13:39

BitWrangler Offline

Rank l33t++

Rank: l33t++
Posts: 8315
Joined: 2017-10-11, 00:55
Location: Ontario

Interesting stuff.

Back in the day, I was sticking the cheapest new CPUs, Cyrix DX2s, on the cheapest used 486 boards, 5V boards with only the 4 pin SX/DX jumper for CPU configuration, and thus thought that Cyrix were "always" 5-10% slower than Intel. In this situation it was highly unlikely the boards were Cyrix aware, and wouldn't have been using writeback. Probably they weren't even that good a board for an Intel DX2, first gen 486 stuff, made for when DX33 was the top end.

Anyway, so I was somewhat surprised on the retro-revisit that Cyrix DX2s could sometimes be a tad faster, with a heaping tablespoon of "it depends". There also seemed to be situations in which intel 486 seemed "mysteriously" faster. By that I mean the performance delta between previous CPU like 386 and subsequent CPU like Pentium, seemed to move.

There's a tendency to think of the i486 as a tidied up 386, and the Pentium as a big leap forward, but what if it's not quite like that? In wikipedia article and other references you see i486 described as single pipe, tightly pipelined. However, a couple of months back, I was reading a deep dive on i486 that seemed to be saying that for some simple instructions, parallel execution was possible in the i486. This is very limited compared to Pentium, and the contrived "best case scenario" for it may not have a whole heap of real world uses. But in something like quake, pentium optimised code, what if it's catching a dual capable instruction pair every now and again? Not hitting them regularly like the dual pipelined Pent, but just every few instructions it gets a pair it can do. Then also you can imagine intel "strongly encouraging" benchmarks to throw a few of these in. Maybe smokescreening it as "necessary for future pentium compatibility" but also giving the i486 a little boost.

OTOH, Cyrix 486 may have been more of the expected "tidied and tightened up" 386, which had a visible step in it's evolution in the 486SLC/DLC class versions. Therefore their ALU may have been working at higher efficiency as a single unit without the help of occasional parallelism, which would explain how when they tuned it further and put two together in dual pipelined 6x86 it took a leap in integer performance past intel's.

However unless anyone knows what I'm talking about, I guess I have to try to rethread my adventures through technical analyses and deep dives I did some months back to try to dig up where it was spelled out.

Edit: Not the thing I was looking for but interesting paper on x87 access differences and goes on to CPU/FPU parallelism in 486 https://dl.acm.org/doi/pdf/10.1145/111048.111052

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 46 of 48, by mkarcher

Posted on 2025-05-04, 19:32

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3495
Joined: 2019-01-19, 16:29
Location: Germany

BitWrangler wrote on 2025-05-02, 13:39:

However, a couple of months back, I was reading a deep dive on i486 that seemed to be saying that for some simple instructions, parallel execution was possible in the i486.

That's not entirely wrong, but I consider it a stretch. The i486 still is sequentially executing one instruction after the other in its single pipeline (putting x87 instructions aside for now). But the i486 does have a two-stage execution pipeline, and it can process the first pipeline step of a new instruction while it still finished the second step of the previous instruction, so this is kind of "executing two instructions in parallel". The first pipeline step is calculating the address of a memory operand (if present), while the second pipeline step is actually executing the instruction, using the address calculated by the first step. A consequence of this design is that you get some penalty if you try to use a register in an addresss calculation that is modified by the instruction directly preceding the instruction that reads this register for address calculation. This is called the "address generation interlock (AGI)", at least in some 486 optimization guides.

And thinking about the "address calculation step", you can in fact construct an example in which the 486 computes two additions in parallel: Something like

1ADD EAX, [ESI]
2LEA  EBX, [ECX + EDX*4 +123h]

might actually be able to perform the "address calculation", which might just be used for its arithmetic effect, while the execution unit is still performing the preceding add, possibly even while the ADD instruction is stalled on the read from [ESI] if that one is a cache miss.

Reply 47 of 48, by mkarcher

Posted on 2025-05-04, 19:39

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3495
Joined: 2019-01-19, 16:29
Location: Germany

Deunan wrote on 2025-05-01, 09:04:

Quake is also a special case. It not only uses FPU for math but also for movig data around, since it was coded for Pentium and this chip could do 64-bit loads and stores but only via FPU since integer registers were still 32-bit (I guess various memory combining tricks were not yet advanced enough to make up that difference).

"citation required". The Quake Source code has been published by id Games by now, and the use of the Pentium FPU to copy data using 64-bit stores to video memory has been claimed by PC enthusiasts basically since Quake started existing, but last time I used that claim, I got challenged and was unable to find any kind of FPU data copying in Quake. OTOH, it is true that the Pentium does not yet support any kind of "USWC" (uncached speculative write combining), which is what all these "MTRR video performance optimizers" like FASTVID enable, so using the FPU was the only way to generate a 64-bit cycle on the FSB, the integer core only generated 64-bit cycles when writing back a dirty cache line from L1. As video memory is uncached, you never got 64-bit cycles to video memory from the integer unit. As PCI has multiplexed address/data lines and some delay for device selection, PCI heavily relies on bursting for good performance. It highly depends on the chipset whether the chipset is able to combine multiple 32-bit FSB writes into a PCI burst cycle, or whether only bursts to PCI video memory on a 64-bit store.

I'm gonna search for the thread where I got challenged on the FPU copy claim, and will edit a link into this post when I find it: Got it!

Reply 48 of 48, by Deunan

Posted on 2025-05-05, 13:57

Deunan Offline

Rank l33t

Rank: l33t
Posts: 2155
Joined: 2018-05-29, 12:32

I admit I have not verified these claims vs Quake source, I based my comment on "common knowledge" from YT videos and the like. So if that's just another Internet rumour I stand corrected. That being said I gave the sources a look and while the Q_memcpy is only optimized for 32-bit bus cycles if the memory regions are aligned, the standard memcpy is also used a lot (including in vid_svgalib.c). In theory it could be coded to use 64-bit FPU transactions - it's a long shot but a possibility. The sources do not contain such C stdlib replacement.

The math.asm (and others) makes heavy use of FXCH though. That's a very fast operation on Pentium, and coupled with the pipelined FPU it can probably keep the FPU well busy with the FMUL/FXCH/FLD interleaving they use. So that's always going to be a special case and very Pentium-friendly code, while performing very differently on other FPUs with their own quirks.

As for the ADD/LEA code example, first of all ADD with memory op is 2 cycles, was that intended? The LEA too, due to the index register usage. In the end the front-end can basically feed only one instruction into the execution unit per cycle. Assuming no other issues (bus stalls, cache fetches, etc). So how's that different from several ADD reg,reg in sequence? Each would still take 1 cycle. Now if the 486 could bypass one pipeline step and the ADD reg,erg would skip the address generation completly - that would allow LEA addition to execute one clock faster, but then again both need to go through final execution/retire in order. So that's not even a latency win, either one pipeline step gets idle or we have to overcome the front and end bottlenecks somehow. So I don't really see it as useful - it it does happen that way at all.

Main menu