VOGONS


Branch Prediction on the Cyrix 5x86 S1R3

Topic actions

First post, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I have been playing around with a set of four Cyrix 5x86-120 QFP CPUs of Stepping 1, Revision 3. I had two CPUs I thought would be good candidates for 2x66 MHz operation with branch prediction enabled. CPU-A excelled at 3.55 V, while CPU-B was solid at 3.70 V. With branch prediction enabled at 133 MHz, software like IE 5.5, K-Meleon-TLS, Acrobat, Office, Cambridge dictionary, etc and Outlaws in D3D mode would run fine on these CPUs two CPUs.

However, with GLQuake, the system would freeze after about 5 minutes of auto-play running at 2x66 or 2x60 MHz. At 2x50 MHz, GLQuake would usually freeze at about 10 minutes into auto-play. I noticed that about 70% of the time, the game would hang up at around the same frame. If I disabled branch prediction, the game would not freeze at all.

For 2x66, PCI was run at 33 MHz. For 2x60, PCI = 30 MHz. For 2x50, PCI = 33 MHz. I tried increasing/decreasing CPU voltage but it did not change the outcome. I used conservative memory and cache timings.

On the other hand, at 3x33 or 3x40 MHz, GLQuake would not hang up with branch prediction enabled. Is anyone able to provide an explanation for these outcomes, particularly at 3x40 vs. 2x60? Does it make sense that GLQuake, in particular, would be sensitive to branch prediction with respect to higher FSB's?, yet not at the lower FSB's?

On the testbed, I was using a Biostar MB-8433UUD and I tried versions 1, 2a, 2b, 3, and 3.1. I tested with 256K cache, 1024K cache, EDO, and FPM. I am running the CPUs with LSSER=off, FPU_FAST=on, DTE_E=on, MEM_BYP=on, BWRT=off, LINBRST=on. I tried 133 Mhz with LINBRST, MEM_BYP, DTE_E all set to disabled, but it did not change the outcome when branch prediction was enabled in GLQuake. I did not check with FPU_FAST off, or LSSER on, as doing so would negate the benefit of using BTB.

For the 2D card and D3D card, I was using a Matrox G200 16 MB. Sound was ESS 1868. Ethernet was 3C515TX with XT-IDE. I used the onboard IDE port with an 80 GB late-gen Maxtor HDD and a CD-ROM drive.

The attachment QFP_Cyrix_5x86-120_S1R3.JPG is no longer available

Plan your life wisely, you'll be dead before you know it.

Reply 1 of 36, by pshipkov

User metadata
Rank l33t
Rank
l33t

I dont own such processors, but your problems are quite the classic.
What happens if you physically remove the L2 cache chips?

retro bits and bytes | DOS media library

Reply 2 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++
pshipkov wrote on 2025-11-30, 07:02:

I dont own such processors, but your problems are quite the classic.
What happens if you physically remove the L2 cache chips?

Physically removing the L2 cache changed nothing. I tested this a few days ago, but tested it again just now.

If FSB = 50/60/66 MHz, then branch prediction cannot be used w/GLQuake, regardless of the CPU's final operating frequency. Like you, I am wanting the issue to somehow be system related, but I have been unable to make this connection. All 4 CPUs have the same reaction to branch prediction at high FSB.

Plan your life wisely, you'll be dead before you know it.

Reply 4 of 36, by mkarcher

User metadata
Rank l33t
Rank
l33t
feipoa wrote on 2025-11-30, 08:02:

If FSB = 50/60/66 MHz, then branch prediction cannot be used w/GLQuake, regardless of the CPU's final operating frequency. Like you, I am wanting the issue to somehow be system related, but I have been unable to make this connection. All 4 CPUs have the same reaction to branch prediction at high FSB.

By the way, thanks for reporting back the your 1MB cache module works.

If I am reading this thread correctly, you are using x2 at 50/60/66, but x3 at 33 and 40. So (disregarding the "why could it be that way?" part), the issue might also depend on the multiplier instead of the FSB clock. I'd recommend to test 2*40MHz to distinguish between a multiplier-related malfunction and an FSB-clock related malfunction.

Now to the "why" part: choosing a different multiplier will change the timing relations between the bus interface unit (or however you call that part on a 486) and the execution unit. Also, having branch prediction enabled will change the timing patterns of bus accesses by the execution unit. If a certain instruction sequence at a certain multiplier happens to generate a pattern the bus interface unit can not realiably handle (I'm thinking about somehting like a cache line fill being complete exactly at the time that very cache line is determined to need replacement), this could explain a multiplier-dependent crash.

Reply 5 of 36, by MikeSG

User metadata
Rank Oldbie
Rank
Oldbie

I want to say lack of capacitance around the CPU... Pentiums had dozens of 1-2uF capacitors in the socket.

What could be a more busier/noisier thing to do, than 60/66Mhz branch prediction GLQuake.

Reply 6 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

pshipkov, I did try a peltier, but not to the extent that the CPU gets nearly 0 C. I tried an 8.2 W 23x23mm peltier, for which the CPU's surface temperature was around 17 C. Sadly, it did not help the situation.

mkarcher, that's an astute observation about the consistency of failure at 2x CLKMUL. I went ahead and tested 2x40 with branch prediction, but there was no crash in GLQuake. I ran it for about 35 minutes. Thus, whether at 2x or 3x CLKMUL, failure occurs when FSB >= 50 MHz and branch prediction is enabled.

Is there any way to circumvent this? I was hoping to keep 2x66 going, but things aren't looking too promising.

I was considering swapping the PLL's 14.31818 MHz crystal so I can get FSB's in the 45 MHz range. I did this experiment about 15 years ago and specifically recall that the diskette controller would not work properly if the crystal was higher than 15.36 MHz. I dug out my old notes, which say that I cannot have have more than a 1.04 MHz increase or decrease in the nominal value of the crystal, that is 14.318 +- 1.04 MHz. However, the notes also draw reference to the keyboard controller having issues, no mention of the diskette controller. Did I remember wrong? Perhaps the crystal tests are worth revisiting?

Looking in my bin, I have 15.36, 16.0, 16.257, and 16.384 MHz. Using the 40 MHz PLL setting (2.7937x), then:

16.0 MHz crystal -----> 44.7 MHz FSB -----> 134. MHz CPU
16.257 MHz crystal -----> 45.42 MHz FSB -----> 136.2 MHz CPU
16.384 MHz crystal -----> 45.77 MHz FSB -----> 137.3 MHz CPU

Aside from the possible failure of the floppy controller, I have doubts that I can achieve 2-1-1-1 L2 timings at 45 MHz FSB with 1024K cache. If I must use 3-2-2-2 at 45 MHz, then it might be better to keep 2x66 without branch prediction.

Last edited by feipoa on 2025-12-02, 12:25. Edited 1 time in total.

Plan your life wisely, you'll be dead before you know it.

Reply 7 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++
MikeSG wrote on 2025-11-30, 10:47:

I want to say lack of capacitance around the CPU... Pentiums had dozens of 1-2uF capacitors in the socket.

What could be a more busier/noisier thing to do, than 60/66Mhz branch prediction GLQuake.

I count 4 ceramic caps in the socket 3, there's another 2 ceramics on the outside which connect to the 3.6 V rail. There's also 4x 100uF on the outside of the socket. The interposer itself has either 4x 100nF or 4x 1uF, plus 2x 100uF tantalum. Is this deficient enough to make branch prediction fail at 50-66 MHz?

If yes, then there's smple space inside the socket for another 11 ceramic caps, see:

The attachment UUD_socket_caps.JPG is no longer available

Plan your life wisely, you'll be dead before you know it.

Reply 8 of 36, by MikeSG

User metadata
Rank Oldbie
Rank
Oldbie

Capacitors are an option if nothing else works.

I once changed the capacitors on a DX4-100 ODPR to modern ceramic and the speedsys score changed from ~148, to 198. No other test improved. This is on a 386 era motherboard with fewer capacitors around the CPU. On all 486 motherboards I tested, prior to the capacitor upgrade, it would report 198. The capacitors replaced were 3x 3.3uF on the CPU.

Modern capacitors are fairly low ESR, so they can make 1-3uF work in a wider range than before. IMO 0.1uF capacitors could be replaced with 1uF.

This is a socket 7 motherboard with 19 caps... Pentiums had more going on, but they did operate at 50-66Mhz FSB as well and you have to wonder about noise.

Reply 9 of 36, by Jasin Natael

User metadata
Rank Oldbie
Rank
Oldbie

Fascinating stuff. I only have one 5x86. It's a 100mhz chip. Never much played with bus speeds with it.
I've also never tried enabling the branch prediction stuff. I should sometime.

Reply 10 of 36, by mkarcher

User metadata
Rank l33t
Rank
l33t
feipoa wrote on 2025-11-30, 11:19:

mkarcher, that's an astute observation about the consistency of failure at 2x CLKMUL. I went ahead and tested 2x40 with branch prediction, but there was no crash in GLQuake. I ran it for about 35 minutes. Thus, whether at 2x or 3x CLKMUL, failure occurs when FSB >= 50 MHz and branch prediction is enabled.

OK, so any theory based on "2x multiplier in GLQuake causes internal operation patterns that fail" seem to be wrong, and the FSB clock being the primary issue is confirmed. I wonder whether enabling branch prediction may cause a certain type of bus requests that are internally "late". On the 486 frontside bus, the rising edge of the clock signal is meant to "kick off some operations that may depend on the current stat of the bus signals". The operation started by the rising edge of the clock then may cause output to the bus, which will appear a certain amount of nanoseconds after the operation was kicked off by the rising clock edge. This output is meant to reach the target component on the bus in time before the next rising edge. So as soon as the processing time, the propagation time and the time required to stabilize the receiver add up to more than a clock period, the system will get unstable.

My suspicion is that with branch prediction enabled, the bus signals sometimes stabilize later than without branch prediction (maybe a hastily initiated code fetch when the prediction turns out false?), and this pushes the processing time over the edge so that the sum of processing time, propagation time and receiving time exceeds the clock period. If this is the correct explanation, I'm afraid there is nothing you can easily do to improve things.

As you said, the system will freeze at a similar frame most of the time, which indicates that there seems to be a pattern to the crashes. Furthermore, you observed it takes some time till a crash happened. If you reboot immediately after a crash, will it run for the same time again, or will it crash again within a minute or so? The reason I'm asking is to find out whether the "5 minutes till crash" is just because the chance of crashing just makes the crash typically happen around the 5 minute mark, or whether the crash is actually related to thermal effects. Especially, I'm considering that possibly the north bridge (UM8881) temperature may influence the time required by the chip to properly receive signals from the Cx5x86. If the time-to-crash is longer on a cold system than on a hot system, consider adding a heatsink to the UM8881 (if it gets warm).

feipoa wrote on 2025-11-30, 11:19:

However, the notes also draw reference to the keyboard controller having issues, no mention of the diskette controller. Did I remember wrong? Perhaps the crystal tests are worth revisiting?

It is quite likely that changing the crystal frequency will interfere with floppy controller operation of the onboard Super-I/O chip. A classic floppy controller used an 8MHz clock for DD disks (250kBit/s), a 9.6MHz clock for the special case of 360K floppies in a 1.2M drive, and a 16MHz clock for HD disks. More modern floppy controllers are able to synthesize the clock from a 24MHz input signal. The UM9515 clock synthesizer chip on your board has a 24MHz output for that very purpose ("peripheral clock", PCK on pin 5). I just traced it, and that output connects via R34 to pin 100 of the UM8663B Super I/O chip. If I remember correctly, that 24MHz input is not only used to clock the FDC, but also to clock the serial ports. So if you "adjust" the 14.318MHz crystal, you will loose the ability to write floppy disks at the correct data rate (reading will work with wider variation, because the floppy controller chip can "capture" the actual data rate from the signal read by the drive if it is close enough to the expected rate), and you will loose serial ports. Both can be avoided if you inject 24MHz at pin 5 of the UM9515 socket instead of using pin 5 of that chip. Also, pin 4 of the UM9515 outputs 14.318 MHz, which is connected via R33 to pin 13 of U7 (a 74F08 fast quad AND gate, which buffers that signal), which will reach the southbridge on pin 148 and pin B30 ("OSC") on the ISA slots. Some cards may be picky about that frequency as well, e.g. ultra cheap VGA cards that run their clocksynth chip off the OSC line, and obviously the original IBM CGA. So for maximum compatibility, you would need to inject "correct" 14.318MHz at pin 4 of the UM9515 socket if you run that chip at a modified clock. In the end, it might be easier to leave the 14.318 alone, and just inject a different processor clock into pin 8 of the UM9515 socket (the processor clock output). You might have a 4-pin osciallator can at 44.9MHz scavenged from some old VGA card.

The keyboard controller is either clocked from the ISA bus (and it should handle 8MHz just fine), or from the 14.318 MHz oscillator divided by 2 (i.e. 7.16MHz). I would not expect the keyboard controller to fail up to 11MHz ISA clock or 16 MHz reference clock.

Reply 11 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

Whether I run GLQuake straight from a cold state, or from a warmed up system, it did not time to crash. I tested this again now, but also added a heatsink to the UM8881F, however it did not alter time to crash.

Thanks for your comments. Now that you mention it, YES, when replacing the 14.3 MHz crystal with higher values, I could normally read to the floppy drive, but not write to it. As such, I think I'll avoid this trap again.

When it comes to lifting up either pin 5 or 8 of the UM9515 (in order to replace the CLK signal), which path would be more suspectable to the noise of an externally dangling device? Meaning, is it less noisy to lift pin 5 and have an external 24 MHz clock, or lift pin 8 and have an external 45 MHz clock? It seems to me that having an external dangle on the 24 MHz pin would be the least problematic from a noise and operational perspective.

I can generate 45 MHz with either:
386 Clock Generator Replacement
or
Re: Project: Full Can Clock Oscillator Replacement
or
https://www.ebay.ca/itm/365753917138

For 24.0 MHz (replacing UM9515 pin 5), I'd need to order something similar to:
https://www.ebay.ca/itm/365754020031
[the brand name of "GREAT ONE" doesn't instil a lot of confidence in such a purchase]

Plan your life wisely, you'll be dead before you know it.

Reply 12 of 36, by ph4nt0m

User metadata
Rank Member
Rank
Member

What build/version of GLQuake do you use? It's open source and there are literally hundreds of them.

I have a Biostar MB-8433UUD, I just don't seem to like it enough to use much. A decent scope would help very much in your case, but if I have to guess, that's either SMD MLCC capacitors or a bug inside the UMC chipset. If you use an interposer, the capacitors inside the socket on the mainboard are of a little importance. 486 boards make use of 100nF MLCCs usually, not sure about voltage rating or dielectric type. I replace them routinely with 10uF/10V X5R, which gives about 5uF effective adjusted for DC bias. Makes a lot of difference anyway.

I would probably start with other 66MHz bus capable CPUs such as AMD 486/586 in 2x66 mode or Pentium OverDrive in 1x66 mode to rule out chipset issues.

My Active Sales on CPU-World

Reply 13 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I'm using GLQuake (0.97) 1.09. Is there a sub version I should look out for? Do you recommend another version for this test? My mini GL version is 1.4, but I also tested with 1.46 and 1.1. Should I play with different mini GL versions? I have saved 1.0, 1.1, 1.4, 1.45, 1.46, 1.47, 1.48, 1.49.

The issue is only with branch prediction enabled at 50/60/66 MHz with a Cyrix 5x86. Am5x86 CPU's don't have branch prediction. I'm confused about how testing an Am5x86 at 66 MHz will provide insight for issues with Cx5x86 + branch prediction. The board runs fine at 2x66 MHz without branch prediction, or fine at 3x40 MHz with branch prediction.

I recall you had a Cyrix running at 2x66 with branch prediction. Would you be willing to test it at 2x66 w/branch prediction +Voodoo2 and let it run in loop for 30-60 minutes? If it runs well, could you relay which motherboard and GLQuake variant? Did you also have LSSER=off, FPU_Fast=on? Thanks!

I re-ran the tests with NT4 and Branch Prediction at 2x60, but it too hangs around the 5 min. mark.

For 10 uF 0805 cermamic's, I have 16 volt pieces in my bin: X6J and X6S. There's 4 spaces on the interposer for 0805. You think replacing those four 100nf ceramic caps will help with branch prediction at 50-66 MHz FSB? Should I also change the 100 uF tantalums? On the interposer now, some have 2x 100nF and 2x 1uF, others have 4x 100nF.

Last edited by feipoa on 2025-12-02, 04:45. Edited 1 time in total.

Plan your life wisely, you'll be dead before you know it.

Reply 14 of 36, by BitWrangler

User metadata
Rank l33t++
Rank
l33t++

Funny, I always regarded Quake as a bit insensitive to overclock glitches, due to numerous configurations of 486 through socket 7 class and beyond "at the limit" for a given board, tending to run it for hours if cooling was adequate, and wig out on doom, windows, or something else. So I never held it in high regard as a stability or burn in test.

As far as I recall, I had branch prediction on for quake with no probs on my umc chipset board with early step 100GP at 2x60... and as mentioned, doom and windows didn't like that setting. Think I was on DOS 6.20 at the time.

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 15 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++
BitWrangler wrote on 2025-12-02, 04:43:

Funny, I always regarded Quake as a bit insensitive to overclock glitches, due to numerous configurations of 486 through socket 7 class and beyond "at the limit" for a given board, tending to run it for hours if cooling was adequate, and wig out on doom, windows, or something else. So I never held it in high regard as a stability or burn in test.

As far as I recall, I had branch prediction on for quake with no probs on my umc chipset board with early step 100GP at 2x60... and as mentioned, doom and windows didn't like that setting. Think I was on DOS 6.20 at the time.

The FSB-dependency of Branch Prediction is only evident in Windows at 50-66 MHz FSB. I have no issue in DOS Quake at 2x66 w/branch prediction.

I'm not only testing Quake, but other standard software from 2005 and earlier. Outlaws in D3D mode had no issue with branch prediction at 2x66, so I wasn't expecting this hiccup in GLQuake. It might boil down to keeping a list of which Windows games don't like BTB. I recall some games don't like FP_FAST - I think it was Turok.

Plan your life wisely, you'll be dead before you know it.

Reply 16 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

Quake2 runs fine at 2x66 with branch prediction in Win95. Branch prediction at 2x66, 2x60, and 2x50 still no go with GLQuake, but 3x40 is fine.

I will try to get something put together to try 3x45 with GLQuake and BTB. For a quick test, I think I will just swap the crystal. I don't need the floppy for a quick test.

Plan your life wisely, you'll be dead before you know it.

Reply 17 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

deleted post made in error

Last edited by feipoa on 2025-12-02, 12:24. Edited 1 time in total.

Plan your life wisely, you'll be dead before you know it.

Reply 18 of 36, by MikeSG

User metadata
Rank Oldbie
Rank
Oldbie
feipoa wrote on 2025-12-02, 02:00:

For 10 uF 0805 cermamic's, I have 16 volt pieces in my bin: X6J and X6S. There's 4 spaces on the interposer for 0805. You think replacing those four 100nf ceramic caps will help with branch prediction at 50-66 MHz FSB? Should I also change the 100 uF tantalums? On the interposer now, some have 2x 100nF and 2x 1uF, others have 4x 100nF.

I would try 10 to 20x 1uF capacitors, with one or two 100-220uF because this is what Pentiums have. They run at 50-66MHz with branch prediction, pipelining etc. This is only from a noise perspective.

Modern 1uF ceramics with low ESR can completely replace 100nf/0.1uF around the CPU, IMO.

Reply 19 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I just ran the system with a 16.000 MHz crystal, such that the PLL output 44.7 MHz FSB. With the CPU running at 134 MHz, I had no issue with branch prediction using GLQuake in Windows 95. So the issue with branch prediction starts somewhere in the 45-50 MHz FSB range on this motherboard.

Unless we are thinking that sufficient noise on the 3.6 V rail is coupling into the the FSB clock, I think the solution lies more readily in the FSB clock signal itself, rather than further clean-up of the 3.6 V rail.

If I'm remembering correctly, on the UUD motherboard, the FSB clock voltage swing decreases with increasing clock frequency. Not only does it decrease, but there is a positive voltage offset in the clock signal. Could this decrease in swing and/or y-axis voltage offset be the cause of these issues with branch prediction at 50+ MHz? If so, it seems to me that the solution would be to use stand-alone crystal oscillator cans rather than the built-in motherboard PLL.

EDIT: GLQuake with branch prediction also ran well with a 16.384 MHz crystal, that is, 45.8 MHz FSB and 137.3 MHz CPU. Unfortunately, the floppy drive doesn't work when there's not a 24 MHz output from pin 5 of the PLL (discussed previously).

Plan your life wisely, you'll be dead before you know it.