FastDoom. A new Doom port for DOS, optimized to be as fast as possible for 386/486 personal computers!

Reply 1240 of 1244, by ViTi95

Posted on 2025-10-07, 22:23

ViTi95 Offline

Rank Oldbie

Rank: Oldbie
Posts: 560
Joined: 2017-02-14, 22:18

Oh, I forgot to mention that all those numbers are indeed FPS (frames per second). 'i486' refers to the old code, 'Pentium' to the new code, and the percentage shows the comparison between the old and new code.

https://www.youtube.com/@viti95

Reply 1241 of 1244, by BinaryDemon

Posted on 2025-10-08, 12:18

BinaryDemon Offline

Rank Oldbie

Rank: Oldbie
Posts: 817
Joined: 2018-01-17, 00:35

Awesome work all around but those IDT WinChip results are ridiculous.

Reply 1242 of 1244, by appiah4

Posted on 2025-11-14, 07:58

appiah4 Offline

Rank l33t++

Rank: l33t++
Posts: 9495
Joined: 2017-02-19, 07:36
Location: Sea of Sorrows

ViTi95 wrote on 2025-10-07, 09:59:
Hi everyone! […]
Show full quote

Hi everyone!

After some time without posting updates here, I’ve got something I think is quite interesting. I’ve been reviewing the span rendering code (R_DrawSpan...) for the backbuffered and VBE2 direct modes, and I managed to create a new optimized version specifically for Pentium processors.

On Pentium CPUs, the SHLD instruction is quite slow (4 cycles) and cannot be paired with other instructions in the U and V pipelines. So what I did was to create a new version that replaces those instructions with simpler combinations that can pair in the U and V pipes (MOV, SHL, AND, OR, etc.). Here are the results I’ve obtained:

Pentium 75:
- FDOOM13H i486: 67.670
- FDOOM13H Pentium: 72.109 (+6.5%)
- FDOOMVBR i486: 80.050
- FDOOMVBR Pentium: 86.337 (+7.8%)

Pentium 100:
- FDOOM13H i486: 85.518
- FDOOM13H Pentium: 90.985 (+6.3%)
- FDOOMVBR i486: 111.096
- FDOOMVBR Pentium: 119.862 (+7.8%)

Pentium 133:
- FDOOM13H i486: 95.349
- FDOOM13H Pentium: 99.196 (+4.0%)
- FDOOMVBR i486: 127.431
- FDOOMVBR Pentium: 134.398 (+5.4%)

Pentium 166:
- FDOOM13H i486: 101.201
- FDOOM13H Pentium: 102.583 (+1.3%)
- FDOOMVBR i486: 139.099
- FDOOMVBR Pentium: 140.108 (+0.7%)

Pentium 200:
- FDOOM13H i486: 110.371
- FDOOM13H Pentium: 110.012 (−0.4%)
- FDOOMVBR i486: 158.691
- FDOOMVBR Pentium: 157.032 (−1.1%)

Analyzing the results, it seems that the new code performs well, but the speedup decreases as CPU frequency increases, even becoming slightly slower on the Pentium 200. This made me think “maybe I’m hitting an architectural bottleneck”, so I decided to test on the Pentium MMX:

Pentium 166 MMX:
- FDOOM13H i486: 105.628
- FDOOM13H Pentium: 110.551 (+4.6%)
- FDOOMVBR i486: 148.413
- FDOOMVBR Pentium: 156.668 (+5.5%)

Pentium 200 MMX:
- FDOOM13H i486: 115.264
- FDOOM13H Pentium: 118.393 (+2.7%)
- FDOOMVBR i486: 169.006
- FDOOMVBR Pentium: 175.136 (+3.6%)

Indeed, the MMX models seems to solve the bottleneck issue that affects the non-MMX models. I also ran tests on other architectures; some with good results, others not so good:

IBM 6x86 PR166+ (133MHz):

- FDOOM13H i486: 92.227
- FDOOM13H Pentium: 93.502 (+1.3%)
- FDOOMVBR i486: 119.018
- FDOOMVBR Pentium: 120.934 (+1.6%)

AMD K5 PR100 (SSA5, 100MHz):
- FDOOM13H i486: 74.988
- FDOOM13H Pentium: 74.822 (−0.3%)
- FDOOMVBR i486: 88.775
- FDOOMVBR Pentium: 88.311 (−0.6%)

AMD K5 PR133 (5k86, 100MHz):
- FDOOM13H i486: 96.094
- FDOOM13H Pentium: 95.822 (−0.3%)
- FDOOMVBR i486: 125.074
- FDOOMVBR Pentium: 124.383 (−0.6%)

In the case of the Cyrix/IBM 6x86, the difference is minimally better, but I’d need to test more models at different frequencies to draw a solid conclusion (I only have this one). As for the K5, results are slightly worse, but again, more frequencies would be needed to confirm that.

And now for the one that completely surprised me, the IDT WinChip! This new code shouldn’t be faster on that CPU, since it has a design closer to a 486 than a Pentium… but the results speak for themselves:

IDT WinChip C6 200:
- FDOOM13H i486: 84.397
- FDOOM13H Pentium: 106.209 (+25.8%)
- FDOOMVBR i486: 110.371
- FDOOMVBR Pentium: 150.061 (+35.9%)

The conclusion I can draw from this CPU is that SHLD/SHRD instructions are extremely slow on the WinChip, and should be avoided whenever possible.

I still need to test more architectures such as the K6, K7, or Pentium II (or maybe some rarer ones like the Transmeta). If anyone can help me with that, I’d really appreciate it. In theory, it should also be faster on the K6 since it decodes SHLD/SHRD poorly and they’re not very parallelizable. I’m attaching a compiled version that includes the new code. If you want to check and compare your results with mine, please use the included CFG (Ultimate Doom 1.9 is required but not included). The commands I used are:
1fdoom13h -timedemo demo3 -i486  
2fdoom13h -timedemo demo3 -pentium  
3fdoomvbr -timedemo demo3 -i486  
4fdoomvbr -timedemo demo3 -pentium

Will you be releasing this update?

Reply 1243 of 1244, by ViTi95

Posted on 2025-11-14, 23:52

ViTi95 Offline

Rank Oldbie

Rank: Oldbie
Posts: 560
Joined: 2017-02-14, 22:18

Yes, of course! I'm still working on this update; I just need more time to fix some issues in the newer code and add render paths for MMX enabled CPUs. I've also found that column rendering is a major issue on AMD K6 CPUs due to cache thrashing, so I'm looking for a way to fix this problem.

https://www.youtube.com/@viti95

Reply 1244 of 1244, by mockingbird

Posted on Today, 00:08

mockingbird Offline

Rank Oldbie

Rank: Oldbie
Posts: 1473
Joined: 2013-06-17, 02:57

ViTi95 wrote on 2025-11-14, 23:52:

Yes, of course! I'm still working on this update; I just need more time to fix some issues in the newer code and add render paths for MMX enabled CPUs. I've also found that column rendering is a major issue on AMD K6 CPUs due to cache thrashing, so I'm looking for a way to fix this problem.

How about go one step further and add CMOV?

(Decommissioned:)

Main menu