VOGONS


Reply 1240 of 1244, by ViTi95

User metadata
Rank Oldbie
Rank
Oldbie

Oh, I forgot to mention that all those numbers are indeed FPS (frames per second). 'i486' refers to the old code, 'Pentium' to the new code, and the percentage shows the comparison between the old and new code.

https://www.youtube.com/@viti95

Reply 1241 of 1244, by BinaryDemon

User metadata
Rank Oldbie
Rank
Oldbie

Awesome work all around but those IDT WinChip results are ridiculous.

Reply 1242 of 1244, by appiah4

User metadata
Rank l33t++
Rank
l33t++
ViTi95 wrote on 2025-10-07, 09:59:
Hi everyone! […]
Show full quote

Hi everyone!

After some time without posting updates here, I’ve got something I think is quite interesting. I’ve been reviewing the span rendering code (R_DrawSpan...) for the backbuffered and VBE2 direct modes, and I managed to create a new optimized version specifically for Pentium processors.

On Pentium CPUs, the SHLD instruction is quite slow (4 cycles) and cannot be paired with other instructions in the U and V pipelines. So what I did was to create a new version that replaces those instructions with simpler combinations that can pair in the U and V pipes (MOV, SHL, AND, OR, etc.). Here are the results I’ve obtained:

Pentium 75:
- FDOOM13H i486: 67.670
- FDOOM13H Pentium: 72.109 (+6.5%)
- FDOOMVBR i486: 80.050
- FDOOMVBR Pentium: 86.337 (+7.8%)

Pentium 100:
- FDOOM13H i486: 85.518
- FDOOM13H Pentium: 90.985 (+6.3%)
- FDOOMVBR i486: 111.096
- FDOOMVBR Pentium: 119.862 (+7.8%)

Pentium 133:
- FDOOM13H i486: 95.349
- FDOOM13H Pentium: 99.196 (+4.0%)
- FDOOMVBR i486: 127.431
- FDOOMVBR Pentium: 134.398 (+5.4%)

Pentium 166:
- FDOOM13H i486: 101.201
- FDOOM13H Pentium: 102.583 (+1.3%)
- FDOOMVBR i486: 139.099
- FDOOMVBR Pentium: 140.108 (+0.7%)

Pentium 200:
- FDOOM13H i486: 110.371
- FDOOM13H Pentium: 110.012 (−0.4%)
- FDOOMVBR i486: 158.691
- FDOOMVBR Pentium: 157.032 (−1.1%)

Analyzing the results, it seems that the new code performs well, but the speedup decreases as CPU frequency increases, even becoming slightly slower on the Pentium 200. This made me think “maybe I’m hitting an architectural bottleneck”, so I decided to test on the Pentium MMX:

Pentium 166 MMX:
- FDOOM13H i486: 105.628
- FDOOM13H Pentium: 110.551 (+4.6%)
- FDOOMVBR i486: 148.413
- FDOOMVBR Pentium: 156.668 (+5.5%)

Pentium 200 MMX:
- FDOOM13H i486: 115.264
- FDOOM13H Pentium: 118.393 (+2.7%)
- FDOOMVBR i486: 169.006
- FDOOMVBR Pentium: 175.136 (+3.6%)

Indeed, the MMX models seems to solve the bottleneck issue that affects the non-MMX models. I also ran tests on other architectures; some with good results, others not so good:

IBM 6x86 PR166+ (133MHz):

- FDOOM13H i486: 92.227
- FDOOM13H Pentium: 93.502 (+1.3%)
- FDOOMVBR i486: 119.018
- FDOOMVBR Pentium: 120.934 (+1.6%)

AMD K5 PR100 (SSA5, 100MHz):
- FDOOM13H i486: 74.988
- FDOOM13H Pentium: 74.822 (−0.3%)
- FDOOMVBR i486: 88.775
- FDOOMVBR Pentium: 88.311 (−0.6%)

AMD K5 PR133 (5k86, 100MHz):
- FDOOM13H i486: 96.094
- FDOOM13H Pentium: 95.822 (−0.3%)
- FDOOMVBR i486: 125.074
- FDOOMVBR Pentium: 124.383 (−0.6%)

In the case of the Cyrix/IBM 6x86, the difference is minimally better, but I’d need to test more models at different frequencies to draw a solid conclusion (I only have this one). As for the K5, results are slightly worse, but again, more frequencies would be needed to confirm that.

And now for the one that completely surprised me, the IDT WinChip! This new code shouldn’t be faster on that CPU, since it has a design closer to a 486 than a Pentium… but the results speak for themselves:

IDT WinChip C6 200:
- FDOOM13H i486: 84.397
- FDOOM13H Pentium: 106.209 (+25.8%)
- FDOOMVBR i486: 110.371
- FDOOMVBR Pentium: 150.061 (+35.9%)

The conclusion I can draw from this CPU is that SHLD/SHRD instructions are extremely slow on the WinChip, and should be avoided whenever possible.

I still need to test more architectures such as the K6, K7, or Pentium II (or maybe some rarer ones like the Transmeta). If anyone can help me with that, I’d really appreciate it. In theory, it should also be faster on the K6 since it decodes SHLD/SHRD poorly and they’re not very parallelizable. I’m attaching a compiled version that includes the new code. If you want to check and compare your results with mine, please use the included CFG (Ultimate Doom 1.9 is required but not included). The commands I used are:

fdoom13h -timedemo demo3 -i486  
fdoom13h -timedemo demo3 -pentium
fdoomvbr -timedemo demo3 -i486
fdoomvbr -timedemo demo3 -pentium

Will you be releasing this update?

Reply 1243 of 1244, by ViTi95

User metadata
Rank Oldbie
Rank
Oldbie

Yes, of course! I'm still working on this update; I just need more time to fix some issues in the newer code and add render paths for MMX enabled CPUs. I've also found that column rendering is a major issue on AMD K6 CPUs due to cache thrashing, so I'm looking for a way to fix this problem.

https://www.youtube.com/@viti95

Reply 1244 of 1244, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie
ViTi95 wrote on 2025-11-14, 23:52:

Yes, of course! I'm still working on this update; I just need more time to fix some issues in the newer code and add render paths for MMX enabled CPUs. I've also found that column rendering is a major issue on AMD K6 CPUs due to cache thrashing, so I'm looking for a way to fix this problem.

How about go one step further and add CMOV?

mslrlv.png
(Decommissioned:)
7ivtic.png