First post, by feipoa
Cyrix 5x86 Register Enhancements Revealed
Did you ever wonder how much improvement those ‘special features’ of the Cyrix 5x86 processor gave? Well, in this study we will investigate the performance enhancing effects of each user configurable register bit settings of the Cyrix 5x86 central processing unit. 36 different benchmark tests were employed to determine the average probable performance enhancement for each feature independently. Many of these register settings, or features, are disabled from the factory by default as a means to increase compatibility with a broad range of motherboards. As the Cyrix 5x86 is a partially downscaled 6x86, many of the conclusions made herein will likely hold true for the Cyrix 6x86 processor.
For the overall performance gain, enabling BTB (branch prediction) showed an 8.5% improvement, LSSER (load/store reordering) showed an 8% improvement, FP FAST (fast floating-point unit) showed a 6% improvement, and MEM_BYP (memory read bypassing) showed a 1.5% improvement. The other features tested were IORT, LINBRST, RSTK, BWRT, LOOP, DTE, WT and showed little to no improvement.
This study adopts the ensuite of benchmarking software utilised in the Ultimate 486 Benchmark Comparison as a means to estimate the performance gain per Cyrix feature. These features were enabled/disabled using the IBM M9 Register Utility Version V1.22 (20 May 1996), however other popular programs of the time were the Peter N. Moss Register Bit Enabler Version C2 (12 May 1996) and ET586 Version 1.1 (28 November 1995) by Evergreen Technologies. The IBM utility was chosen because of its graphical user interface in DOS and the ease of enabling bits. Since Windows 98SE was selected for performing Windows-based benchmarks, a Cyrix NT driver was not needed; features were enabled in DOS prior to booting into Windows 98SE. If special Cyrix 5x86 features are desired in Windows NT and Windows 2000, Evergreen Technologies created ET586NT, which is an NT driver that runs as a device automatically at start-up. Below is a table containing a map of various Cyrix 5x86 register bits (CR0 is not included).
This table represents a snapshot of which Cyrix 5x86 features are set by the motherboard as default. The bolded entries are features which were later enabled using the IBM utility, that is, LSSER is to be changed to 0, LOOP_EN to 1, RSTK_EN to 1, BWRT to 1, and FP_FAST to 1. When enabled, the entirety of these settings constitute My Default Settings as noted on the chart in Appendix 2 (column A) and below. These settings are considered optimal/stable settings on the employed motherboard, a Biostar MB8433-UUD v3.0 with a Cyrix/IBM 5x86c-100HF running at 133 MHz (2 x 66 MHz) and 3.85 V. This voltage was selected as stable on this particular motherboard, CPU, and cooling environment, however other CPU/motherboard combinations may require a different core voltage for thermal/frequency stability. This stable voltage is typically in the 3.65 – 3.85 V range for 133 MHz operation.
Note that LSSER is optimal when it is set to 0, not 1. A feature is typically said to be enabled when it is set to 1, and disabled when it is set to 0 (except for LSSER, which is opposite). WT and IORT, when set to 0, are also theoretically the most optimal setting, however they exist mainly for reasons of cross-platform stability.
BIT ENABLING PROGRAMS
Since most motherboard manufacturers did not enable the special features of the Cyrix 5x86 in the BIOS, bit enabling software programs were generally required, however one manufacturer (PC Chips M919) allowed for two such features to be enabled in the BIOS (LINBRST and LSSER). While it is not well documented why more companies didn’t follow suit, is likely due to time constraints and the fact that the 486 was considered a low-end, low-priority item by mid-1996. Unfortunately, even the latest BIOS update for the Biostar MB8433 UUD, dated May 1996, does not include a user adjustable enabler for Cyrix features.
While the IBM utility uses a GUI as a means to enable the special features, the Peter Moss utility uses on/off flags to enable features, for example, to enable BTB_EN and RSTK_EN type, 5x86.exe /BTB_EN=on /RSTK_EN=on. The Evergreen utility is a little more cumbersome to use since you need to type in HEX values for an entire register (8 bits, or features, per register). For example, for the performance control register (PCR0), if you only want LOOP and RSTK enabled (and the other bits of this register disabled), you’d need to type, ET586.exe /PCR0=5, where 5 is hexadecimal digit. From the above chart, 00000101 in binary representation is equal to 5 in hexadecimal. Even more confusing is that the Windows NT/2000 driver from Evergreen Technologies requires its units to be in decimal. It just so happens that, in this case, 5 in hexadecimal is also 5 in decimal, though this is not always the case. Screenshots of the various bit enabler programs are shown in Appendix 1.
There exists a fourth bit enabling program called CyrixGo, or Free5x86, however it is very limited in which features it can enable.
The testing scheme used herein is such that all features known to be stable with the employed CPU/motherboard combination were enabled (referred to as DEFAULT SETTINGS in column A of Appendix 2, and My Test Settings in the Test Settings section of this report). Then a specific feature was turned off (i.e. LSSER = 1, or LOOP = 0, or RSTK = 0, etc.), and the decrease in benchmark scores were tabulated in an adjacent column (column B, LSSER). Before testing the next feature (column C, LOOP), the previous feature was re-enabled (LSSER set back to 0).
The reason for testing this way, that is, always having all of the most optimal features enabled except for the one being tested, was because it is unknown whether one feature will greatly alter the performance of another. It is assumed that a user would want all optimal features enabled, unless one needs to be disabled for reasons of instability. Such a case was discovered with LOOP, in which LOOP only seems to have a noticeable performance enhancing effect when BTB was enabled. LOOP enabled on its own did not improve performance.
The charts in Appendices 2 & 3 list the effects each feature had on the indicated benchmark program, while the charts in Appendices 4 & 5 normalise the results to that of the optimal/stable default settings (column A). This is done so that a relative change in performance can be established. Columns B thru I are the most common known stable features and column A contains results when all these features are enabled. Conversely, column J shows the results when B thru I are all disabled. Columns K*, L*, and M* are feature configurations which contain a performance boost from the chosen DEFAULT SETTINGS, however they are not likely to be long-term stable in Windows. IORT (column N), which controls the I/O recovery time, is generally a setting controlled by the BIOS, however the longest possible recovery time of 128 clock cycles was set to determine if this setting had any affect on performance.
While not a Cyrix-specific next generation feature, benchmark results were also tabulated for cases where the CPU’s L1 cache was placed into write-through mode, and with L1 cache entirely off (columns O and P, respectively). For these latter two features, it is important to remember that the DEFAULT SETTINGS of column A were still employed. For this to occur, for example as with setting the CPU into write-through mode, the CPU must still be set to write-back mode in the BIOS initially, then later changed to write-through mode in software, otherwise the next generation features (column A) did not have the same enhancing effects. That is to say, if you set your BIOS to L1 write-through mode, it was determined that later enabling the special features had less performance improvement than when setting L1 to write-through mode in software. To set the cache to write-through mode in software, it is first necessary to set LOCK NW = 0 then set CD = 0 and NW = 0.
Once all test results were normalised to DEFAULT SETTINGS, they were averaged for ALU- and FPU-specific tasks in terms of percent increases/decreases in performance. Some tests did not show any performance boost, however those too were equally averaged in. This method of performance characterisation has been termed Average probable boost as indicated on the bar graphs to follow. Since some tests showed a very large increase in performance while others showed little or no increase, the percent boost of the best test case is also included separately on the graphs and is termed Maximum observable boost. This difference is due to the fact that some CPU features enhance only specific instructions in the software code, while others do not. The performance boosts in the charts are ordered, or ranked, by their average probable boost. Both Windows and DOS results were grouped together; however Appendices 6 & 7 contain a DOS only section for those who are interested primarily in DOS performance.
Chart entries bolded in Appendices 2 & 3 indicate a change of greater than 2% from DEFAULT SETTINGS.
Biostar MB8433-UUD v3.0 Motherboard - UMC 8881F/8886BF, [BIOS: UUD960326S, 03/26/1996]
IBM 5x86C - 100HF at 133 MHz (Step 0, Rev 5), FSB = 66 MHz, CLKMUL = 2X, Vcore = 3.85 V, 1:1/2 FSB:PCI
64 MB Fast-page mode RAM (60 ns) [BIOS: 1WS/0WS]
512 KB Single-banked L2 SRAM Cache (15 ns), Write-back [BIOS: 3-2-2]
PCI Slot 1 = Adaptec 2940U2W PCI SCSI Controller w/Seagate ST373307LW Ultra320 Harddrive
PCI Slot 2 = 3Com 3c905C-TX-M, 10/100Base-TX (disabled in Windows)
PCI Slot 3 = Matrox Millennium G200 PCI Graphics Card, 16 MB SDRAM
ISA Slot 4 = Creative Labs AWE64 Gold, 28 MB RAM (CT4390)
*My Default Settings* - Cyrix 5x86 Register Bits
[PCR0=5h, CCR1=2h, CCR2=D6h, CCR3=1Ch, CCR4=38h, WBE (CD=0, NW=1)]
RSTK_EN = 1 Enables the return stack so that RET instructions will speculatively execute following a CALL. [1 is optimal]
BTB_EN = 0 Invokes the branch target buffer for instruction addresses, thereby inducing branch prediction. Not used. [1 is optimal]
LOOP_EN = 1 Enables the prefetch buffer loop for destination jumps still present in the prefetch buffer (prevents buffer flushing/reloading). [1 is optimal]
LSSER = 0 If set to 0, memory reads and writes to the load/store memory management unit can be reordered for optimum performance. [LSSER=0 is optimal]
WT1 = 1 Enables write-through in region 1 (640KB-1MB). Forces all writes to region 1 that hit the L1 cache to be sent to the external bus. [WT1=0 is optimal]
BWRT = 1 Enables the use of 16-byte burst write-back cycles. [1 is optimal]
LINBRST = 1 Enables a linear address sequence while performing burst cycles (as opposed to i486 "1+4" address sequencing). [1 is optimal]
FP_FAST = 1 Enables Fast FPU exception handling. [1 is optimal]
MEM_BYP = 1 Enables memory read bypassing so that data can be read from the write buffers prior to being written to external memory. [1 is optimal]
DTE_EN = 1 Enables the directory table entry cache. [1 is optimal]
IORT = 000 Specifies the minimum number of clock cycles between I/O accesses (I/O recovery time). [000 is optimal]
USE_WBAK = 1 Enables write-back L1 cache pins. [1 is optimal]
CD = 0, NW = 1 Enables write-back L1 cache. [01 is optimal]
For more information on how these features work and for what type of code they enhance, please refer to the Cyrix 5x86 BIOS Writers Guide, the Cyrix 5x86 Microprocessor Guide, the Cyrix 6x86 BIOS Writers Guide, the Cyrix 6x86 Data Book, the Peter Moss Utility’s documentation, and register for an introductory course in computer architecture.
RESULTS - ALU
From the graph shown above, it is clear that BTB had the largest impact for ALU-focused processes, with a 22% boost. BTB, or branch prediction, on a Cyrix 5x86 is generally not considered a stable setting in Windows except possibly with Stepping 1, Revision 3 CPUs. To get BTB working on Stepping 1, Revision 3 CPUs, it is necessary to disable LOOP, BWRT, and possibly RSTK. The CPU used in this study was Stepping 0, Revision 5. It was possible to run the noted Windows benchmark tests with LOOP, BWRT, and RSTK disabled, however the only way to boot into Windows was to first boot into DOS, then type win at the command console to enter Windows. BTB appears stable in DOS with both revisions of the CPU. To date, no other CPU revisions have been encountered. The Cyrix 5x86-80 and 5x86-100 came in Stepping 1, Revision 3 editions, whereas Stepping 0, Revision 5 CPUs came in 100, 120, and 133 MHz flavours.
It was recently discovered that Windows NT4/98SE/2000 all initially appear usable with a Stepping 0, Revision 5 CPU and all Cyrix features enabled (including BTB) except for LOOP, RSTK, BWRT, and DTE. Stability, however, had the tendency to decrease as the CPU was run longer (and began to heat up). This effect may be more of a consequence of running the CPU overclocked and above thermal spec for core voltage than with a broken feature. Some Cyrix features may be more frequency and/or thermal sensitive than others.
Referring to Appendix 6, we see that BTB did not have such a magnificent impact for DOS-only ALU tests; in DOS, performance boost dropped to only 5%. An interesting point to note from the DOS-only ALU chart is that LOOP yielded a performance gain only when used in combination with BTB, thereby bumping the results up another 1%. Also surprising was that RSTK seemed to have no effect on its own accord. Unfortunately, LOOP and BTB together were not very stable. They may only work together in 16-bit mode since Windows would not boot with this setting and 3Dbench, Doom, Pcpbench, and Quake wouldn’t run. It may be that BTB and LOOP work well together with Stepping 1, Revision 3 CPUs, however this configuration wasn’t been tested.
Next in line for performance was LSSER at 7.4%. It was previously thought that LSSER needed to be disabled (set to 1) for motherboards which contained in-use PCI slots, however the author has had LSSER enabled (set to 0) for years with 3 filled PCI slots and hasn’t had issues. From personal experience, the Biostar MB8433-UUD and PC Chips M919 both functioned with LSSER enabled.
Surprisingly, setting the L1 cache scheme to write-through mode instead of write-back only indicated a 4% improvement for ALU performance, though some tests showed as much as a 12% improvement. It should be noted that all other enhancements were still enabled. If write-back L1 cache is disabled in the BIOS instead of through software, it may not be possible to fully enable other Cyrix special features (although they may appear to be enabled).
LINBRST, which is noted in the Cyrix literature as improving performance, only helped by an average of 0.5%. Your motherboard’s chipset must support linear burst write cycles to enable this feature, otherwise your system will crash upon enabling it. While both the Biostar and M919 support LINBRST, there seemed to be no real performance boost, unless perhaps this feature cannot be disabled in software once the BIOS has enabled it. If this is the case, the only way to test for it would be to use a comparable motherboard which does not support this LINBRST, such as the Shuttle HOT-433. Only CPUMark32 in Windows caught the 4% improvement with LINBRST.
MEM BYP had the same ALU fate as LINBRST, with only a 0.5% improvement, however some tests weighed it in at 3.3%. FP FAST also didn’t do much for ALU operations. Surprising is that IORT, which was set to 128 clock cycle delays, didn’t seem to drop the performance much. It may be that the BIOS took over this setting without the possibility for intervention, however setting IORT to 128 clock cycles did drop the frame rate in DOOM by 5 fps.
WT, when enabled, sets only the memory region from 640 KB – 1000 KB into write-through mode (as opposed to write-back mode). When set to 0, you get write-back mode in this small region of memory, though doing so causes Quake not to run, and will yield an extra 2 fps in Doom. DTE and BWRT also had little to no effect on performance.
To summarise this section, if you can get BTB working, use it with a smile. LSSER is next in line for performance boost. Don’t be too bummed if you cannot get the other features working, though if you can, adding up all the effects of the weak performers adds another 1% of boost on top of LSSER. If all stable/optimal enhancements are enabled, you should see a 13.5% ALU performance boost (B – I), though some applications may see up to a 50% boost. If you can get BTB working, you may see another 8.5% of ALU boost. From the Ultimate 486 Benchmark Comparison, the combined total boost of these Cyrix 5x86 register features weighs the ALU of a Cyrix 5x86-133 in at about the level of an AMD X5-160, or a Pentium 100 with pipeline burst cache enabled. Not too shabby for a low-cost, low-power 486.
RESULTS - FPU
The floating-point units (FPU), or co-processors, of microprocessors are used extensively in high-performance games, simulation software, mp3 conversion, modeling, etc, so depending on your intended use of the processor, these results may be more (or less) important than the ALU results. Starting with BTB, we see that, on average, enabling this feature decreased ALU performance, but only in Windows. In DOS, it enhanced performance by about 2%. This performance drop wasn’t an isolated case, since WinTune98, Sandra99, and PassMark all indicated a drop in performance.
The clear leader for FPU performance enhancement was FP FAST, which boosted the average FPU test results by 18%. Next in line was LSSER with 11.4%. MEM BYP helped by as much as 1%, while all other features had a relatively low average probable boost. It may be naive to entirely dismiss the other features since some tests indicated significant improvement. MEM BYP, BWRT, and LINBRST all had about a 5% improvement for the best case test. Even BTB, whose average is negative, improved performance by 13% in some tests.
The entire ensemble of stable/optimal features improved FPU performance by about 22%, and from the Ultimate 486 Benchmark Comparison, a Cyrix 5x86-133 rates in at about a Pentium 90. Even an AMD X5 overclocked to 200 MHz was about 4 Pentium ratings (PR points) below the Cyrix 5x86-133 in FPU-related tasks.
RESULTS - OVERALL
The overall performance graph encompasses a much larger number of tests than either the ALU- or FPU-specific graphs so the overall results are far more encompassing. The above graph can speak largely for itself. In line with the conclusions made in the ALU and FPU sections, BTB, LSSER, FP FAST, and perhaps MEM BYP are the most important of all the Cyrix 5x86’s special features. The Symantec (Norton) Sysinfo benchmark program really seemed to think that MEM BYP hot stuff; it boosted the results of this test by 18%, however the average probable boost was a mere 1.4%
Considering that LSSER had a large impact on both ALU and FPU operations, it may be considered the most important Cyrix feature to enable. While BTB looks like a big contributor overall, it had less than half the performance of LSSER for DOS-only activities (not to mention its poor FPU performance). Rated second is a toss-up between BTB and FP FAST; it is hard to look past the whopping 18% improvement offered by FP FAST. In fourth place is MEM BYP, followed perhaps by LINBRST and BWRT.
The write-back caching scheme of the Cyrix 5x86 is charted mainly out of curiosity and is not a feature specific to Cyrix 5x86 processors. Both AMD and Intel employed write-back caching in their later 486 CPU revisions.
Looking now at the raw DOS gaming scores, LSSER and FP FAST showed the most improvement in Quake, gaining about 1.3 fps each, while BTB only improved Quake by 0.3 fps. With all stable Cyrix features (B – I) considered, we see about a 2.6 fps gain in Quake and about 2 fps gain in Doom. For 3Dbench, all stable Cyrix features (B – I) improved the score by 3%, whereas BTB alone improved the score by 4%.
A list of register settings I use for a variety of other CPUs can be found here, Register settings for various CPUs