Reply 60 of 108, by mrau
imho 100, so scaling could be judged better
imho 100, so scaling could be judged better
Keep in mind that I do intend to remake the ultimate 486 benchmark thread with by using more games, e.g. early 3D and whatever else I have in my folder waiting for testing. I would like to use the MB-8433UUD instead of the crippled M919 due to its PCI bus issue at 40 MHz. Currently, the 3D data for 40 MHz CPUs is deflated becuase the PCI bus is run at 27 MHz. I also intend to take all non-socket3 chips up to 133 MHz so that the results bleed more into next generation data.
I was under the impression that, on the M919, if you were to switch to the 40MHz bus from 33MHz after the system was booted, you could get around the PCI bus speed restriction.
I never tried it myself as I sold my M919 board before hearing about this, but maybe its worth a try.
I finished the i486 DX4, however I don't feel content. I've decided to make this into a mini high-end socket 3 + accelerated Voodoo comparison. Will be comparing these:
Cyrix 5x86-120 with branch prediction
Cyrix 5x86-100 with branch prediction
Pentium Overdrive 83
Pentium Overdrive 100
4 chips to go.
Very nice - finally some new data on interesting hardware. Looking forward to the results.
Dang feipoa, you are nailing this! I have a project in the works that I will be sharing details for soon, I'd love to see what you have to add on it. Soon... 🤣
Ooooo, you have me intrigued, yet at the same time, I don't want to know so that we can maintain the suspense!
Another chip which I think very relevant to add would be the Am5x86-150. Mentally, you think it would make for a fast system, but in fact, the wait states at 50 MHz may need to be set to max, just like with 66 MHz. I'll have to do some testing to determine the fastest stable wait states for this configuration. Seems like every time I try to run at 150 MHz, I am disappointed.
You are free to perform your own tests and reach your own conclusions, so don't shit on others' results because you feel like they don't interest you. Every piece of data is important in the bigger picture.
Thanks for saying that.
thats what i'm thinking all the time 😁
hopefully he will not edit his posts now and exchange everything with a dot.
There will always be someone on this forum, or any forum, who's goal it is, aware or not, to antagonise others and to force their will upon others. This will always be the case anywhere you go in life. There were other antagonists on this forum from years past who have since disappeared. There will no doubt be more.
The best thing to do is to change the way you look at what others write. This is often easier said than done, but if you are able to subtract from the writing any ego, arrogance, anger, or inappropriate content, you left with the underlying purpose of their message. In this case, it looks as if this individual did not care for results with an IBM 5x86c-133 and is mostly interested in absolute frame rates. While my only intent with this post was to see what percent faster a Voodoo2 might be compared to a Voodoo1 on a fast socket 3, I do not mind making this post into a 3D-accelerated CPU comparison.
I suspect some of this individual's anger stemmed from disappointment in the results. I thought I recall reading somewhere in which he states the Voodoo2 would not improve upon a Voodoo1 on such a slow platfom. While not the intent, I suspect he may have taken the results as a challenge to his authority. Perhaps the 13% increase in speed with the Voodoo2 was too much. Internally, some people cannot handle being wrong and will twist things to provide sufficient doubt in the results. I suspect something similar has transpired here. I also thought the Voodoo2 would not be 13% faster than a Voodoo1 on a socket 3. 5% at best. I also suspect his disapproval was due to the results being done on a platform which he does not possess.
Once I release the results, I'm sure there will be something else wrong.
The Pentium Overdrive 100 came out on top of everything, with the exception of one game. This was entirely expected. The results of the top CPUs are pretty much similar in games. You will see how 40 MHz FSB CPUs really take full advantage of the 40 MHz PCI bus speed, to the point that it largely makes up for lack of core speed. I've drawn up charts for each game, as well as made two charts of averages - one chart for the average of all games (per CPU) normalised to the P100, and another chart for the average of the raw fps (not normalised). Normalised results are more telling for relative performance. Within the next few days, I'll try to analyse the results and offer a write up.
I would be nice if I could figure out exactly what the minimum requirements are from Vogons so as to not have the charts shrunk. I had this figured out with the old version of the Vogons website, but not the new one. Concerning this, I sent a message to qbix years ago but no response. It is probably some product of image resolution, file size, and maybe colour depth.
As it seems absolute frame rates were of greatest interest, I am only using the Voodoo2 card for this comparison. For Voodoo 1 vs Voodoo 2, I do not expect the 11-13% difference to deviate much (or any?) from an IBM 5x86c-133 to an Am5x86-160. To estimate the Voodoo1 results, subtract the percent per game tabulated previously.
Enclosed above are the gaming results for select high-end socket 3 CPUs, tested in 3dfx glide mode with a Voodoo2 in Windows95. The Voodoo2 was selected over the Voodoo1 because it was about 12% faster on this platform. For all but one minor exception, the Pentium Overdrive clocked at 100 MHz came out on top by a landslide. The next closest competition, the IBM 5x86c clocked at 133 MHz, would need to increase its performance by about 15% to reach the average performance of the POD100. The POD83, with only a 33 MHz FSB and PCI bus, drops the POD100's performance down 14%, which is right where it should be. By convention (normalisation), the POD100 has a socket 3 "Pentium rating" of 100, while the POD83 has a Pentium rating of about P84. The IBM 5x86c-133 falls in at P87, the Am5x86-160 at P83, and so forth.
3D accelerated benchmark results make heavy use of floating-point operations, the graphics card hardware, and the PCI bus. This is largely contrary for the case of ordinary office, internet, and ALU-intensive applications of the mid-90's. For non-gaming use, it is expected that the performance of the Am5x86-160 would equate to or exceed that of the POD100. The speed preference for the Am5x86-160 in these applications is rather apparent from a qualitative perspective, that is to say, one could feel the effects in Windows by clicking around, opening apps, etc. It is even more apparent is if you run the POD83, yet the 3D gaming performance of the POD83 bests the Am5x86-160. Perhaps such office oriented tasks do not take advantage of the superscalar nature of the Pentium?
In some games, e.g. Dark Forces 2 and Descent 1, the AMD Am5x86-160 came out ahead of the IBM 5x86c-133, while in most others the IBM 5x86c-133 prevailed. With a 66 MHz FSB, the IBM 5x86c-133/2s has a L1/L2/RAM read speed (cachechk) of 273/102/70 MB/s, while the AMD is only at 165/75/48 MB/s. How does the Am5x86-160 maintain a close second place in many of these games? The AMD has the advantage of a faster PCI bus speed, which generally doesn't matter much for office-based applications, however makes a sizeable impact in applications requiring the frames to change rapidly. It would be interesting to benchmark the Am5x86-160 with a 27 MHz PCI bus to witness the quantitative degradation in frame rate.
There are two games in particular which strongly favour the Pentium: GLQuake and GLHexen II. This is not surprising considering they use the same Quake engine and were optimised for Pentium architecture. The greatest performance divergence between the Am5x86-160 and IBM 5x86c-133 was also with these two games. In GLQuake, the IBM 5x86c-133/4x was about 9% faster than the Am5x86-160, and about 10% faster in GL Hexen2. Clock-for-clock and FSB-for-FSB, the Cx5x86-133/4x comes out ahead of the Am5x86-133 by 22% and 38% in GLQuake and GL Hexen II, respectively. The benchmarks largely prefer the IBM 5x86-133/4x with these two games in particular, when compared to other games, so it seems likely that the Pentium optimisations used in the coding of the Quake engine also benefit the Cyrix 5x86 architecture over that of the AMD 5x86.
While not superscalar, the Cyrix 5x86 literature touts benefits which increase instructions per clock cycle, like a decoupled load/store unit to allow for out of order execution and running integer and floating-point operations in parallel. The Cyrix 5x86 also has a much stronger raw floating-point advantage over the Am5x86. I found it interesting that the Am5x86-160 beat the Cyrix 5x86-133/4x in all games except for GL Quake and GL Hexen II. Before this roundup of games, I had mostly compared these two CPUs with GLQuake, which seems to have skewed results in favour of the Cyrix. In GLQuake, for example, the Cyrix 5x86-133/4x was faster than the Am5x86-160 by 0.4%; and in GLHexen2 they were equal. In most other games, the Am5x86-160 was quite a bit faster than the Cyrix 5x86-133/4x, that is, by 11% in Descent1, 8% in Descent2, 2% in Tomb Raider Unfinished Business, 11% in Outlaws, 4% in Incoming, 5% in Forsaken, 14% in Dark Forces 2, 5% in Turok, 2% in Turok2, and 15% in Unreal. On average, the Am5x86-160, when run on a fully optimised system, is about 6% faster than the Cyrix 5x86-133/4x. It would be interesting to check this trend using the same games in software mode. It would also be interesting to run the tests again with the Cyrix 5x86-133/4x in a SiS 496 based motherboard and a Voodoo3 in hopes of shrinking that 6%.
A Cyrix 5x86-133/4x has L1/L2/RAM read speeds of 272/93/50 MB/s, while an IBM 5x86c-133/2x has L1/L2/RAM read speeds of 273/102/70 MB/s. Doubling the FSB doesn't double memory throughput because, among other things, wait states are needed to ensure that the system runs stable. We get a 10% boost in L2 reads and a 40% increase in RAM reads. Determining the least RAM and cache wait states to ensure a stable system is an art in itself. In the employed motherboard, a 1 ws memory read wait state was just stable. Increasing the cache to 1024K from 256 K or the RAM from 64 MB to 128 MB would cause errors after sustained use unless running with 2 ws. My system originally had 1024K with 64 MB, but having to run the wait state at 2 ws hurt performance, so I dropped the cache down to 256K and haven't had a problem since. Having less wait states with less cache proved faster than more wait states with more cache. Unfortunately, the slowest wait state of 3-2-2-2 was needed for the L2 cache to achieve stability. The result was that the system with the IBM 5x86c-133/2x was about 4% faster than the Am5x86-160 and 10% faster than the Cyrix 5x86-133/4x.
By altering the memory read and write wait states in the system with an IBM 5x86c-133/2x, you can achieve similar memory read and write performance to that of the Cyrix 5x86-133/4x. Using cachechk as verification, I had to adjust my memory read/write wait states for 1ws/0ws to 3ws/2ws. In reality, the mark is between 3ws/2ws and 2ws/2ws, but 2.5ws is not possible. With 3ws/2ws, the Unreal results matched exactly that of the Cyrix 5x86-133/4x.
In the past, I had tested the Am5x86-150 with 1024K/128M, however with 256K/64M, I was able to get the system stable with a cache wait state of only 3-1-1-1. It would be fun to test the system with 10 ns cache (counterfeit) or 12 ns cache (real) to see if the wait states could be reduced to 2-1-1-1. Because of wait states, the L2 speed of the Am5x86-150 was the same as the Am5x86-160. The Am5x86-160 still has the benefit of the 40 MHz PCI bus, while the Am5x86-150 is running at 33 MHz. The resultant is that the Am5x86-160 is about 4% faster than the Am5x86-150.
Clock-for-clock, the Cyrix 5x86-133/4x was about 14.5% faster than the AMD 5x86-133 and the Cyrix 5x86-120 was about 13.5% faster than the Intel DX4-120. Nothing really exciting there as this was pretty much expected. For gaming, the Cyrix 5x86-133/4x about equaled the Cyrix 5x86-120, that is, assuming you are able to run the RAM and L2 cache without additional wait states. For some motherboards, this isn't the case and the Cyrix 5x86-133 will be faster than the Cyrix 5x86-120. Using double-banked cache as opposed to single-banked cache is a large factor in achieving low wait-state stability for conditions of an FSB faster than 33 MHz. Also, using the least quantity of RAM modules helps, as does using FPM over EDO, and sticking to 256K cache for borderline conditions. With the near equality of the Cyrix 5x86-120 and Cyrix 5x86-133 in games, finding a stepping 1, revision 3 (S1R3) Cyrix 5x86 will offer an additional boost because S1R3 chips are stable with the branch prediction feature in Windows, while S0R5 chips are only stable with branch prediction in DOS. The average gaming boost with branch prediction enabled is about 2%, and the maximum observable boost was 6% in Turok.
When considering using a POD83 for your vintage gaming system, there doesn't appear to be a lot of benefit compared to using an Am5x86-160. On average, the POD83 was about 1% faster. In some games, the Am5x86-160 is faster, in some they are equal, and in others the POD83 is faster. The largest deviance being with GLQuale and GLHexen2, which the POD83 demonstrated a 30% advantage over the Am5x86-160. In games like Descent 2 and Tomb Raider, the Am5x86-160 was 15% faster than the POD83. There is a caveat though, a big one - many socket 3 motherboards do not work properly with the L1 cache of the Pentium Overdrive set in write-back mode. There is a 5-20% performance hit when the L1 cache is set to write-through mode, e.g. 5% in Quake (software mode), 12% in SuperPi, 22% in Winbench96's Graphics Winmark, etc. Taking 10% as an average, this reduces the the speed to a normalised score of P75.
Some of the benchmark results have been recorded from instantaneous values taken at the same instance in each game, or tabulated as an average frame rate taken over a period of time. It is important to note that, for an outside viewer, these benchmark results are just numbers with limited meaning. To know what 28 fps feels like in GLQuake, for example, you must experience it yourself. In games, there are generally instances of slow down and speed ups in the frame rate which some might consider "unplayable", while others are not bothered. I thought Unreal was tolerable until I exited the spaceship and had to start firing on the enemy. On an IBM 5x86c-133/2x or Am5x86-160 system, I personally think that most people would find Descent 1&2, GLQuake, Tomb Raider, TR Unfinished Business, Outlaws, Forsaken, and Dark Forces 2 enjoyable. It might be less with Turok and GL Hexen II, however I found that they played good enough and were enjoyable. For Incoming, Turok 2, and Unreal, I think a desperate teenager in the mid-90's would still force himself to play them if that was all he had, though I envision him slamming the keyboard with frustration from time to time.
This is my brief write-up. It is by no means perfect. I didn't spend as much time on it as I had hoped. There are certainly a lot of other comparisons which can be made. The normalised chart conveys the jist of the results. I have also attached the data in pdf format below.
Oh my, this is a lot to take in. Thank you so much for these, will definitely look further into them once I have more time!
huge - nice work!
kinda bugs me how descent 2 runs faster than descent - this never ran well on my box back then;
i understand that the voodoo absolutely required floating point representation of geometry and stuff?
the POD leaves everything behind, while i'm pretty sure the high end AMD and Cyrix/IBM stuff was faster in everyday tasks.
It would be fun to test the system with 10 ns cache (counterfeit) or 12 ns cache (real) to see if the wait states could be reduced to 2-1-1-1.
I doubt it will give noticeable boost if any at all.
All the other kids, with the pumped up kicks
You'd better run, better run, faster than my blaster
Perhaps such office oriented tasks do not take advantage of the superscalar nature of the Pentium?
The uniqueness of the Pentium is that it's the only superscalar x86 CPU that does not have out-of-order-execution.
This means that instructions must be ordered correctly to make use of both pipelines at the same time (there are various 'pairing rules' for instructions. Firstly, obviously the second instruction must not be dependent on the first. Secondly, certain instructions can only be executed in the first pipeline, such as shifts).
With hand-optimized assembly or a good Pentium-optimizing compiler, you could get quite efficient use out of the second Pentium pipeline.
Running 'legacy' code (code optimized for a 486 or earlier), the use of the second pipeline is basically just dependent upon 'chance', and worst-case it is not being used at all.
With games and multimedia software, performance was always very important, so the developers would quickly adopt new CPUs and optimize their code for them. This was probably not that much of a priority for the average office task.
The Pentium Pro and newer aren't all that much faster than a Pentium on paper, but their ability to reorder instructions on-the-fly decouples them from the compiler output, and allows them to pair instructions dynamically at runtime, to make better use of the superscalar pipeline.
In extreme cases, optimized code can still be faster on a Pentium than on a Pentium Pro or Pentium II at the same clockspeed. But in the average case, the Pentium Pro/II will get better instruction throughput.
That is pretty intersting. What I don't really understand is how the logic works to re-order the instructions. This logic is presumably hardware-based and on the CPU itself. I would imagine it is quite complex, but what's the crux of the alogrythm? Is it like a little ASIC in the CPU dye? Also, to reorder the instructions, the CPU must gather multiple instrutions for the reordering. What exactly does this mean? Are there, for example, a half-dozen instructions layed out in the pipeline, probably in a shift register, then get reordered within that shift register? And what ordering would be optimal? It is a fascinating subject that I wish I knew more about from a high-level perspective.