VOGONS


Branch Prediction on the Cyrix 5x86 S1R3

Topic actions

Reply 20 of 36, by BitWrangler

User metadata
Rank l33t++
Rank
l33t++

Are you running the K6 patches in case it's a Windows timing loop thing?

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 21 of 36, by mkarcher

User metadata
Rank l33t
Rank
l33t
feipoa wrote on 2025-12-02, 12:47:

If I'm remembering correctly, on the UUD motherboard, the FSB clock voltage swing decreases with increasing clock frequency. Not only does it decrease, but there is a positive voltage offset in the clock signal.

How was this determined? The "higher FSB clocks" are suspiciously close to the bandwidth limit of hobbyist scopes and their corresponding probes. A "60MHz bandwidth scope" will show a 60MHz signal at half the actual amplitude and very rounded, even if it is nearly square. I would be reluctant to trust the "low swing" statement unless I know for sure that it is not caused by the measurement equipment. On the other hand, bandwidth limitation should not cause DC offset. DC offset might be a symptom caused by clock asymmetry (high period longer than low period). If the clock low period is generally 4ns shorter than the high period (which might be caused by different low-to-high and high-to-low propagation delays in a clock buffer), this would be 13ns low / 17 ns high at 33MHz, which is likely provides sufficient high and low time, but 5.5/9.5 at 66 MHz which no longer feels OK, as the duty cycle approaches 2:1.

The processor and chipset datasheets should include specifications on the quality of the clock signal. Assuming "positive voltage offset" is a true effect and not just a measurement artifact, it sounds alarming, though. If anything on the FSB requires TTL level clocks, only the time periods where the clock is below 0.8V counts as "guaranteed low", and the time below 1.5V is most likely treated as low. A positive offset might cause the "clock low period" to become too short for stable operation.

Getting to the root cause of this issue is likely very difficult. We don't know whether the issue is located inside the processor, or is related to interaction between the CPU and the chipset. It would be interesting to replicate the test on a SiS496 mainboard. If you still observe BTB failure in GLQuake starting at 50MHz, it likely is a design limitation of the CPU. If the limit is significantly different, I'm leaning towards FSB setup/hold timings being the issue.

Reply 22 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++
mkarcher wrote on 2025-12-02, 18:47:
How was this determined? The "higher FSB clocks" are suspiciously close to the bandwidth limit of hobbyist scopes and their corr […]
Show full quote
feipoa wrote on 2025-12-02, 12:47:

If I'm remembering correctly, on the UUD motherboard, the FSB clock voltage swing decreases with increasing clock frequency. Not only does it decrease, but there is a positive voltage offset in the clock signal.

How was this determined? The "higher FSB clocks" are suspiciously close to the bandwidth limit of hobbyist scopes and their corresponding probes. A "60MHz bandwidth scope" will show a 60MHz signal at half the actual amplitude and very rounded, even if it is nearly square. I would be reluctant to trust the "low swing" statement unless I know for sure that it is not caused by the measurement equipment. On the other hand, bandwidth limitation should not cause DC offset. DC offset might be a symptom caused by clock asymmetry (high period longer than low period). If the clock low period is generally 4ns shorter than the high period (which might be caused by different low-to-high and high-to-low propagation delays in a clock buffer), this would be 13ns low / 17 ns high at 33MHz, which is likely provides sufficient high and low time, but 5.5/9.5 at 66 MHz which no longer feels OK, as the duty cycle approaches 2:1.

The processor and chipset datasheets should include specifications on the quality of the clock signal. Assuming "positive voltage offset" is a true effect and not just a measurement artifact, it sounds alarming, though. If anything on the FSB requires TTL level clocks, only the time periods where the clock is below 0.8V counts as "guaranteed low", and the time below 1.5V is most likely treated as low. A positive offset might cause the "clock low period" to become too short for stable operation.

Getting to the root cause of this issue is likely very difficult. We don't know whether the issue is located inside the processor, or is related to interaction between the CPU and the chipset. It would be interesting to replicate the test on a SiS496 mainboard. If you still observe BTB failure in GLQuake starting at 50MHz, it likely is a design limitation of the CPU. If the limit is significantly different, I'm leaning towards FSB setup/hold timings being the issue.

I don't recall; I only briefly looked into this 15 years ago and would need to run the tests again. However, based on your comments, I feel there's little to gain from this effort. My next steps are to test 3 different PLL's of the same pinout to see if there's any improvement in 50-66 MHz FSB w/GLQuake and BTB. I should have MX8315, CM-something, and another UM9515. Following this, I would hack on my push-button DIP-14 SI5351 to serve as the clock signal for the CPU. Ensure it works w/floppy and check the FSB-BTB limit.

Depending on the outcome of the SI5351 tests, I was planning to test the FSB-BTB issue on an LSD board. The Lucky Star LS-486E rev.D (known locally as "LSD") is the only SiS 496 based board I know of which works well at 2x66, the caveat being that the floppy drive controller won't work at this FSB.

Plan your life wisely, you'll be dead before you know it.

Reply 23 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I tested the MX8315PC, CMA8815(B), and two UM9515-01 chips with differing datecodes, but the result at BTB issue at 66 MHz remained unchanged.

Next, I worked on injecting a new clock for pin 8 of the UM9515. Below are some images of the PLL hack I incorporated on the MB-8433UUD. It required cutting off pin 8 from the UM9515 PLL, but conveniently all the pins from the new clock generator lined up perfectly. I have extra UM9515 chips so I did not mind sacrificing one. The original UM9515 is still used on pin 5 for the 24 MHz peripheral clock.

The attachment UM9515_adjustable_PLL_hack_1_.JPG is no longer available
The attachment UM9515_adjustable_PLL_hack_2.JPG is no longer available
The attachment UM9515_adjustable_PLL_hack_3.JPG is no longer available

CHKCPU determines the near correct CLKMUL and FSB, although the motherboard's POST screen showed a Cyrix 5x86 at 120 MHz. Actual FSB is at 45.0 MHz and can be increased/decreased in 1 MHz steps.

The attachment UM9515_adjustable_PLL_hack_chkcpu.JPG is no longer available

If I can manage to get 2-1-1-1, 0/0 ws stable, then cachechk looks like this:

The attachment UM9515_adjustable_PLL_hack_cachechk.JPG is no longer available

It ran fine in DOS Quake with 256K at 2-1-1-1, 0/0 ws, and PCI = 45 MHz. DOS Quake ran at 19.2 fps. Booting to Windows, there was an error, so some timing isn't perfected - probably L2 cache. For now, I set L2 to 3-2-2-2 and EDO to 1/0 ws. Booted to Windows 95 with branch prediction and ran GLQuake without incident. Read/wrote to a floppy beautifully.

One drawback of 45 MHz PCI is that PIO-4 won't function. This is an issue I've noticed on some MB-8433UUD boards, even at 40 MHz. Half my UUD board are OK with PIO-4 at 40 MHz PCI, whereas others need PIO-3. I'm currently using a board that needs PIO-3 at 40 MHz. On my list is to figure out why this discrepancy exists and how to correct for it.

Another issue which exists at 45 MHz PCI on the UUD - Jake's Master & FIFO update for Windows 95 doesn't appear to work at 40+ MHz, at least not with my larger 50 GB partitions (XT-IDE).

Finally, it isn't clear if running the Voodoo2 at 45 MHz PCI is damaging. The alternative is to run PCI = 2/3 FSB = 30 MHz, albeit with a waveform who's wavelength has different peak/valley widths. I forgot what issues this caused, but mkarcher documented this in another thread.

Next up is to determine optimal L2, DRAM, and PCI speeds with the 1024K module. If it requires 3-2-2-2 and 1/0 ws, then 2x66 is the preferred route, even if BTB doesn't function in select games. After this, I will pull out the LSD motherboard to check for the BTB-FSB dependency on the SiS 496 chipset.

Plan your life wisely, you'll be dead before you know it.

Reply 24 of 36, by MikeSG

User metadata
Rank Oldbie
Rank
Oldbie
feipoa wrote on 2025-12-02, 12:47:

Unless we are thinking that sufficient noise on the 3.6 V rail is coupling into the the FSB clock, I think the solution lies more readily in the FSB clock signal itself, rather than further clean-up of the 3.6 V rail.

You said it's stable at 66Mhz w/o branch prediction so noise related to the FSB clock must be fine.... it's only noise generated by the branch prediction function, combined with the furiousness of GLQuake.

If you could record the 3.6v rail right as it crashes then you could see for sure.

Reply 25 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

When you say 'record', you mean with a logic analyser? The possible solution you've proposed is to add 10 uF caps to the interposer and socket, or just the interposer? There's only 4 SMD spots on the interposer. It would be faster just to try the proposed remedy than for me to fiddle around with a logic analyser (I have one, but never used it).

Plan your life wisely, you'll be dead before you know it.

Reply 26 of 36, by MikeSG

User metadata
Rank Oldbie
Rank
Oldbie

I don't have a logic analyser. I thought you might have something that records noise and 'zoom out' to a ~10 second length of time, then when it crashes pause it and take a photo...

10 to 20 in total, all 1uF ceramic low ESR. On the interposer and/or in the socket. Stacked vertically if no room...

Is it worthwhile to have branch prediction on at all? The 66Mhz seems like the majority of the gain.

Reply 27 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

MikeSG, I'll have to revisit this issue in a bit. Something broke on the motherboard.

I had been testing to see how far I can push the Cyrix 5x86.

3x45 = 135 MHz, 3-1-1-1, 0/0 ws, BTB ----> OK
3x46 = 138 MHz, 3-1-1-1, 0/0 ws, BTB ----> OK
3x47 = 141 MHz, 3-1-1-1, 0/0 ws, BTB ----> Win95 errors

swap to CPU-A, retest 3x47 -----> less errors, but not quite stable
increase voltage to CPU-A -----> test runs OK for awhile, then hangs.
Reboot system -----> Primary IDE port won't work.

Maybe I stressed the system too hard somewhere. I had been running PCI at only 47*2/3 = 31.3 MHz. PIO-4.

The Secondary IDE port works fine, just not the Primary. I followed the traces from the Primary IDE port header to two '245' bus transceiver chips. These traces then go to the UM8886BF Southbridge. I desoldered the '245' chips and soldered DIP sockets in their place. I tried several other '245' chips, but it didn't help.

Next, I replaced the UM8886BF Southbridge. The BIOS still won't detect the CF or HDD on the Primary port (Secondary still OK). It feels like the issue resides where the BIOS communicates with the Southbridge. I grabbed a fresh EEPROM and programmed it with the UUD BIOS. Then performed an NVRAM reset, but the issue with Primary IDE detection is not resolved. I test all surface SMD resistors. They are fine.

Does the Super I/O or Northbridge facilitate communication between the BIOS and the IDE ports? Any ideas for what to check? I have extra UM8663BF (Super I/O) chips, but my extra Northbridge chips (UM8881F) are of the older variety without EDO or LINBRST support.

Plan your life wisely, you'll be dead before you know it.

Reply 28 of 36, by MikeSG

User metadata
Rank Oldbie
Rank
Oldbie

If it was due to voltage increase, check all components with a low max voltage... what else is the Primary IDE related to...

Caps, resistors, coils should be fine. Is there an IDE 'fuse' on anything...

Reply 29 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I forgot to update this. The culprit was a barely severed trace which ran under the motherboard's VRM. It was the IDE_CS0 trace. This trace goes to pin 45 on the 8886BF, which appears to be labelled as IDE1FX (AS0#*). The cut wasn't readily visible unless the VRM was removed. I only found it after tracing out all the IDE pins.

How did the trace get cut? I must have been too rough with the jumper cable I use for measuring the VRM's voltage. I re-tinned the trace, and all is back to normal.

The attachment IDE_CS0_trace_cut.JPG is no longer available
The attachment IDE_CS0_trace_re-tin.JPG is no longer available

As far as running the Cyrix at intermediate bus frequencies, 47 MHz was the last stable option. For peace of mind, it might be wise to use 46 MHz instead. These were benchtop tests, not in-chassis tests.

Option A)
3x47 = 141 MHz w/BTB
PCI = 2/3*47 = 31.3 MHz
PIO-4
ISA = 10.4 MHz
L2/DRAM = 3-1-1-1, 0/0 ws
CPU A at 3.70 V
Windows ESDI mod works, meaning IDE = 11.31 MB/s (multi-sector w/FIFO)
DOS Quake = 20.1 fps
GLQuake = 26.1 fps
Outlaws = 14.4 fps

Option B)
3x47 = 141 MHz w/BTB
PCI = 1*47 = 47 MHz
PIO-3
ISA = 11.75 MHz
L2/DRAM = 3-1-1-1, 0/0 ws
CPU A at 3.70 V
no Windows ESDI mod, meaning IDE = 6.08 MB/s (original ESDI file)
DOS Quake = 20.1 fps
GLQuake = 27.0 fps
Outlaws = 14.8 fps

Option C)
3x45 = 135 MHz w/BTB
PCI = 2/3*45 = 30.0 MHz
PIO-4
ISA = 10.0 MHz
L2/DRAM = 2-1-1-1, 0/0 ws
CPU A at 3.60 V or CPU B at 3.70 V
Windows ESDI mod works, meaning IDE = 10.84 MB/s (multi-sector w/FIFO)
DOS Quake = 20.0 fps
GLQuake = 25.6 fps
Outlaws = 15.0 fps

Option D)
2x66 = 133 MHz
DOS BTB probably OK, but need to keep a list of which Windows games don't work with BTB
PCI = 1/2*47 = 33.3 MHz
PIO-4
ISA = 11.1 MHz
L2/DRAM = 3-2-2-2, 1/0 ws
CPU A at 3.60 V or CPU B at 3.70 V
Windows ESDI mod works, meaning IDE = 13.27 MB/s (multi-sector w/FIFO)
DOS Quake = 19.8 fps
GLQuake = 25.5 fps (no BTB)
Outlaws = 14.7 fps

Unfortunately, the ESDI mod doesn't work when FSB >=40 MHz.

Which option would you go with?

Plan your life wisely, you'll be dead before you know it.

Reply 30 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++
MikeSG wrote on 2025-12-02, 12:21:
feipoa wrote on 2025-12-02, 02:00:

For 10 uF 0805 cermamic's, I have 16 volt pieces in my bin: X6J and X6S. There's 4 spaces on the interposer for 0805. You think replacing those four 100nf ceramic caps will help with branch prediction at 50-66 MHz FSB? Should I also change the 100 uF tantalums? On the interposer now, some have 2x 100nF and 2x 1uF, others have 4x 100nF.

I would try 10 to 20x 1uF capacitors, with one or two 100-220uF because this is what Pentiums have. They run at 50-66MHz with branch prediction, pipelining etc. This is only from a noise perspective.

Modern 1uF ceramics with low ESR can completely replace 100nf/0.1uF around the CPU, IMO.

The tallest I could go was three 0805 MLCC capacitors. I only had six 1 uF 0805 caps in my bin. As such, I went with 4 towers of 0.1 uF, 1 uF, 10 uF, for a total of 12 MLCC capacitors. In addition, there were 2 tantalums of: 100 uF, 75 m-ohm. Shown here:

The attachment Cyrix_QFP-PGA_5x86_S1R3_BTB_MLCC_towers_1.JPG is no longer available
The attachment Cyrix_QFP-PGA_5x86_S1R3_BTB_MLCC_towers_2.JPG is no longer available
The attachment Cyrix_QFP-PGA_5x86_S1R3_BTB_MLCC_towers_3.JPG is no longer available

Unfortunately, there was no improvement when using BTB at 66 MHz FSB in GLQuake; the system hangs hard around the 5 minute mark. With 45 MHz FSB, there is no issue. Before wrapping this up, I will get a setup going with a SiS496 based system at 2x66.

Plan your life wisely, you'll be dead before you know it.

Reply 31 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I wanted to show that the MLCC towers did add some benefit for noise reduction. I measured at the point of the MLCC with a low inductance probe.

Here's the towers:

The attachment MLCC_towers_noise_measurement_1.JPG is no longer available
The attachment MLCC_towers_noise_measurement_2.JPG is no longer available

And here's it's twin with only four 0.1 uF ceramics and the same 100 uF tantalum:

The attachment MLCC_no_tower_noise_measurement_1.JPG is no longer available
The attachment MLCC_no_tower_noise_measurement_2.JPG is no longer available

The towers cut the noise in half, but didn't change the end objective. The noise level is already acceptable.

Plan your life wisely, you'll be dead before you know it.

Reply 32 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

By way of comparison, this is what the noise on an IBM produced, Thinkpad 5x86c interposer looks like:

The attachment MLCC_Thinkpad_IBM_5x86c_noise_1.JPG is no longer available
The attachment MLCC_Thinkpad_IBM_5x86c_noise_2.JPG is no longer available
The attachment MLCC_Thinkpad_IBM_5x86c_noise_3.JPG is no longer available

I didn't measure the noise on the QFP208 leads. The risk of a short is almost certain.

Plan your life wisely, you'll be dead before you know it.

Reply 33 of 36, by amadeus777999

User metadata
Rank Oldbie
Rank
Oldbie

Great work!
Screenshots carry a strong "Dr. Frankenstein vs the incredible MB8433" vibe.

Reply 34 of 36, by bertrammatrix

User metadata
Rank Member
Rank
Member
feipoa wrote on 2025-12-21, 12:17:
The tallest I could go was three 0805 MLCC capacitors. I only had six 1 uF 0805 caps in my bin. As such, I went with 4 towers of […]
Show full quote
MikeSG wrote on 2025-12-02, 12:21:
feipoa wrote on 2025-12-02, 02:00:

For 10 uF 0805 cermamic's, I have 16 volt pieces in my bin: X6J and X6S. There's 4 spaces on the interposer for 0805. You think replacing those four 100nf ceramic caps will help with branch prediction at 50-66 MHz FSB? Should I also change the 100 uF tantalums? On the interposer now, some have 2x 100nF and 2x 1uF, others have 4x 100nF.

I would try 10 to 20x 1uF capacitors, with one or two 100-220uF because this is what Pentiums have. They run at 50-66MHz with branch prediction, pipelining etc. This is only from a noise perspective.

Modern 1uF ceramics with low ESR can completely replace 100nf/0.1uF around the CPU, IMO.

The tallest I could go was three 0805 MLCC capacitors. I only had six 1 uF 0805 caps in my bin. As such, I went with 4 towers of 0.1 uF, 1 uF, 10 uF, for a total of 12 MLCC capacitors. In addition, there were 2 tantalums of: 100 uF, 75 m-ohm. Shown here:

The attachment Cyrix_QFP-PGA_5x86_S1R3_BTB_MLCC_towers_1.JPG is no longer available
The attachment Cyrix_QFP-PGA_5x86_S1R3_BTB_MLCC_towers_2.JPG is no longer available
The attachment Cyrix_QFP-PGA_5x86_S1R3_BTB_MLCC_towers_3.JPG is no longer available

Unfortunately, there was no improvement when using BTB at 66 MHz FSB in GLQuake; the system hangs hard around the 5 minute mark. With 45 MHz FSB, there is no issue. Before wrapping this up, I will get a setup going with a SiS496 based system at 2x66.

You should try it on the m918 as well since BobocoCz had good results with it. Sure the cache performance isn't great, but it was still faster than any of my LS486s were in that regard with the cyrix. I keep staring at mine, I wish I had one of those interposers so I could give my QFP 120 a whirl on it, sadly the seller I spoke of ended up ghosting our conversation

Reply 35 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++
bertrammatrix wrote on 2025-12-21, 23:58:

You should try it on the m918 as well since BobocoCz had good results with it. Sure the cache performance isn't great, but it was still faster than any of my LS486s were in that regard with the cyrix.

Ya, I'll run it on the M918 before disassembling the towers. Or should I keep the towers? There's risk of shorting with the heatsink, but I have put some kapton tape on the top of the tower to mitigate this risk. Regarding tests at 150 MHz, this is CPU-B, so it's not likely to reach 150 MHz. CPU-A is the steller unit. CPU-B can usually do 141 MHz.

bertrammatrix wrote on 2025-12-21, 23:58:

I keep staring at mine, I wish I had one of those interposers so I could give my QFP 120 a whirl on it, sadly the seller I spoke of ended up ghosting our conversation

I have an extra QFP208-PGA168 I can send you. It is missing the N/C pad, but that shouldn't matter any. Alternately, you can send me your CPU and I can solder it on for you. The primary disadvantage of this type of interposer is the height. In the photo below, you can see the pins on my secondary interposers are on a soldered socket, which adds to the seated height of the CPU. This added height will make it difficult or impossible to use a Z-clip, depending on how flexible your Z-clip is. A Z-clip is used for clipping on heatsinks to a socket 3, one which has the Z-clip tabs (not all of them do).

The attachment taller_QFP208_to_PGA168_interposer_1.JPG is no longer available
The attachment taller_QFP208_to_PGA168_interposer_2.JPG is no longer available

Plan your life wisely, you'll be dead before you know it.

Reply 36 of 36, by feipoa

User metadata
Rank l33t++
Rank
l33t++

I checked to see if the FSB-dependence on branch prediction (BTB) was motherboard specific. Recall that BTB worked in GLQuake on the UUD motherboard at 45 MHz FSB (3x45), but not at 50-66 MHz. I have setup a Lucky Star LS-486E Rev:D (aka, "LSD") with mostly the same hardware used in the UUD board. This board:

The attachment LSD_60-66_MHz_FSB_test1.JPG is no longer available
The attachment LSD_60-66_MHz_FSB_test2.JPG is no longer available

Consistent hardware between the UUD and LSD motherboards were: Cyrix 5x86-120-CPU-B, 64 MB EDO stick, Matrox G200, Voodoo2-12 MB, and a 3C515TX ethernet card. On the LSD, I could not get ESS ES1868 sound working in Windows 95, so I had to use a different sound card (YMF-719 based). I am using a CF card on the LSD board, whereas on the UUD board, I used a mechanical HDD. LSD used 8 ns 256K, whereas USD used 10 ns 1024K cache.

On the LSD, I could not get ISA sound to function when the FSB was set to 66 MHz. As such, I ran GLQuake w/out sound. With BTB enabled, GLQuake ran fine. I tested it for 1 hr. Next, I tested the LSD at 2x60 with sound working; GLQuake ran fine for an hour with BTB enabled.

These results suggest there is a motherboard, chipset, or BIOS (timing) factor which is limiting Branch Prediction from working on the UUD board at 50-66 MHz. Whatever this factor is, it was only apparent with FSB >=50 MHz. Considering that the time it takes GLQuake to crash at 50 MHz is much longer than at 66 MHz, does this point to a GLQuake-specific timing issue? I'm not sure how to further diagnose the issue without investing much time and effort with a logic analyser. I did fiddle with the BIOS timings quite a bit already.

I want to point out that Quake 2, on the other hand, worked well with BTB on the UUD board at 2x66. For Quake 2, it is FP_FAST which must be disabled (at any FSB). The game Turok also will not work with FP_FAST. Not having FP_FAST is a larger blow to performance in most Windows 3D games compared to BTB.

The inability to get ISA sound working on the LSD board at 66 MHz rules out its use as a replacement board for this system. I will either settle on 2x66 with a note to disable BTB before running GLQuake, or set the system to run at 3x45 or 3x46.

Plan your life wisely, you'll be dead before you know it.