Oh, I forgot to mention that all those numbers are indeed FPS (frames per second). 'i486' refers to the old code, 'Pentium' to the new code, and the percentage shows the comparison between the old and new code.
After some time without posting updates here, I’ve got something I think is quite interesting. I’ve been reviewing the span rendering code (R_DrawSpan...) for the backbuffered and VBE2 direct modes, and I managed to create a new optimized version specifically for Pentium processors.
On Pentium CPUs, the SHLD instruction is quite slow (4 cycles) and cannot be paired with other instructions in the U and V pipelines. So what I did was to create a new version that replaces those instructions with simpler combinations that can pair in the U and V pipes (MOV, SHL, AND, OR, etc.). Here are the results I’ve obtained:
Analyzing the results, it seems that the new code performs well, but the speedup decreases as CPU frequency increases, even becoming slightly slower on the Pentium 200. This made me think “maybe I’m hitting an architectural bottleneck”, so I decided to test on the Pentium MMX:
Indeed, the MMX models seems to solve the bottleneck issue that affects the non-MMX models. I also ran tests on other architectures; some with good results, others not so good:
In the case of the Cyrix/IBM 6x86, the difference is minimally better, but I’d need to test more models at different frequencies to draw a solid conclusion (I only have this one). As for the K5, results are slightly worse, but again, more frequencies would be needed to confirm that.
And now for the one that completely surprised me, the IDT WinChip! This new code shouldn’t be faster on that CPU, since it has a design closer to a 486 than a Pentium… but the results speak for themselves:
The conclusion I can draw from this CPU is that SHLD/SHRD instructions are extremely slow on the WinChip, and should be avoided whenever possible.
I still need to test more architectures such as the K6, K7, or Pentium II (or maybe some rarer ones like the Transmeta). If anyone can help me with that, I’d really appreciate it. In theory, it should also be faster on the K6 since it decodes SHLD/SHRD poorly and they’re not very parallelizable. I’m attaching a compiled version that includes the new code. If you want to check and compare your results with mine, please use the included CFG (Ultimate Doom 1.9 is required but not included). The commands I used are:
Yes, of course! I'm still working on this update; I just need more time to fix some issues in the newer code and add render paths for MMX enabled CPUs. I've also found that column rendering is a major issue on AMD K6 CPUs due to cache thrashing, so I'm looking for a way to fix this problem.
Yes, of course! I'm still working on this update; I just need more time to fix some issues in the newer code and add render paths for MMX enabled CPUs. I've also found that column rendering is a major issue on AMD K6 CPUs due to cache thrashing, so I'm looking for a way to fix this problem.