VOGONS


Reply 840 of 984, by 7F20

User metadata
Rank Member
Rank
Member
ViTi95 wrote on 2023-07-30, 16:47:

No, FastDoom doesn't use the FPU at all, so no difference between a 486SX and a 486DX.

Just curious if it using the FPU would make it faster on systems with it present, assuming it is even possible?

Reply 841 of 984, by DracoNihil

User metadata
Rank Oldbie
Rank
Oldbie
7F20 wrote on 2023-07-31, 15:38:

Just curious if it using the FPU would make it faster on systems with it present, assuming it is even possible?

Doom doesn't use floating point arithmetic anywhere, it's all strictly integer based maths. I don't think it would be a performance benefit to re-write critical functions for x87 usage either because of how everything in the Doom Engine strived to avoid touching that sort of thing in the first place.

“I am the dragon without a name…”
― Κυνικός Δράκων

Reply 842 of 984, by 7F20

User metadata
Rank Member
Rank
Member
DracoNihil wrote on 2023-07-31, 15:45:

Doom doesn't use floating point arithmetic anywhere, it's all strictly integer based maths. I don't think it would be a performance benefit to re-write critical functions for x87 usage either because of how everything in the Doom Engine strived to avoid touching that sort of thing in the first place.

That whole era of transition to floating point stuff is all kind of big mystery to me, so it's interesting to hear some of the details. Thanks for the response

Reply 843 of 984, by BitWrangler

User metadata
Rank l33t++
Rank
l33t++

Yah, you'd get games saying a minimum of DX2-66, DX-50 or something for clock speed reasons rather than the FPU, so it was a little bit of a problem to untangle at the time even. Reasons you'd want to know today for machines you could run them faster than slideshow are all the weird stuff like SX2s, BL2 and BL3 with no 387 onboard, and Nexgen 586 ditto.

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 844 of 984, by ViTi95

User metadata
Rank Member
Rank
Member
7F20 wrote on 2023-07-31, 17:31:
DracoNihil wrote on 2023-07-31, 15:45:

Doom doesn't use floating point arithmetic anywhere, it's all strictly integer based maths. I don't think it would be a performance benefit to re-write critical functions for x87 usage either because of how everything in the Doom Engine strived to avoid touching that sort of thing in the first place.

That whole era of transition to floating point stuff is all kind of big mystery to me, so it's interesting to hear some of the details. Thanks for the response

Is quite simple. The Pentium processor optimized a lot the FPU, so it could execute FPU instructions much faster compared to the 386 or the 486. Also it allowed to execute FPU instructions while doing integer math. That main difference made the FPU usable in realtime. Even some instructions in floating point math are faster compared to integer math on the Pentium (MUL/FMUL).

For example, here is a comparation of the FMUL instruction between multiple Intel CPUs (up to the Pentium):

FMUL    Floating point multiply
FMULP Floating point multiply and pop

variations/
operand 8087 287 387 486 Pentium
fmul reg s 90-105 90-105 29-52 16 3/1 FX
fmul reg 130-145 130-145 46-57 16 3/1 FX
fmul mem32 (110-125)+EA 110-125 27-35 11 3/1 FX
fmul mem64 (154-168)+EA 154-168 32-57 14 3/1 FX
fmulp reg s 94-108 94-108 29-52 16 3/1 FX
fmulp reg 134-148 134-148 29-57 16 3/1 FX

s = register with 40 trailing zeros in fraction
FX = paireable with FXCH (Pentium)

Doom didn't use the FPU at all because the main CPU available when it was released was the 386 and 486. None of these CPUs can execute FPU instructions fast enough for a 3D world.

https://www.youtube.com/@viti95

Reply 845 of 984, by appiah4

User metadata
Rank l33t++
Rank
l33t++

I am guessing there are quite a few modern engine replacements that do floating math instead anyway? I can not imagine gzdoom still using arithmetic calculations for example? That kind of a replacement would probably not yield much extra performance out of the 486 fpu though..

Retronautics: A digital gallery of my retro computers, hardware and projects.

Reply 846 of 984, by rasz_pl

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-07-31, 23:10:

Is quite simple. The Pentium processor optimized a lot the FPU, so it could execute FPU instructions much faster compared to the 386 or the 486. Also it allowed to execute FPU instructions while doing integer math.

All previous FPUs could do that. 386 / 387 Concurrency
486 http://bitsavers.trailing-edge.com/components … essor_Nov89.pdf page 147
FDIV 79 cycles, Concurrent Execution 70 cycles.
Perspective correction required slow fdiv, Quake interleaves fdivs with integer instruction to exploit that. Re: 2D Acceleration - first chipsets
Everyone (including me) always assumed it was perspective correcting and FDIVs that slowed it down on non Pentium CPUs, but now I believe its the rest of FPU code heavy in FXCH instructions. Maybe its also that Quake perspective correction is hard coded to ~38 integer math cycles every FDIV and rewriting that to 70 could speed up on 486.

The real Pentium innovation was pipelining of FPU, meaning it could execute multiple FPU instructions in parallel.
https://www.agner.org/optimize/instruction_tables.pdf Page 164 "fp-ov"
"Overlap with floating point instructions. fp-ov = 2 means that the last two
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)"
Pairability shows when you can execute free 0 cycle FXCH instruction simultaneously with previous one.

This one shows FPU instruction overlap in action with multiple fxch between https://github.com/id-Software/Quake/blob/bf4 … e/d_parta.s#L55 (afaik Michael Abrash is responsible for all the interleaving low level assembly optimizations)
0 cycle FXCH is the key here, and is the reason for AMD/Cyrix sucking at Quake http://www.azillionmonkeys.com/qed/cpuwar.html
"early revs of the K6 took two clocks, while later revs based on the "CXT core" can execute them in 0 clocks."

appiah4 wrote on 2023-08-01, 05:48:

I am guessing there are quite a few modern engine replacements that do floating math instead anyway?

Yes. Upgraded doom engines are no longer 2.5D. They let you look up/down with no distortion, that means they need FPU.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 847 of 984, by ViTi95

User metadata
Rank Member
Rank
Member

I did some testing in FastDoom trying to use FPU instructions along integer math but always resulted slower (using an Intel 386DX + Cyrix 387). As for Quake, @linear@nya.social has optimized some of the FPU code in 486quake and got about a 15% performance uplift for every CPU (including the Pentium). The Quake code is such a mess and very FPU dependant that I wasn't able to help much. I only was able to test a custom build using half resolution rendering (160x200), and that made the game much more playable on my Am5x86@133 (but did nothing more as the HUD isn't designed for that resolution, it sometimes crashes).

Regarding modern Doom engines, there is a Doom port that uses 64-bit fixed point math and is able to render at very high resolutions, just using SIMD instructions (even on Core2 cpu's!). It doesn't use the FPU at all!

https://www.doomworld.com/forum/topic/117394- … -with-kdikdizd/

https://www.youtube.com/@viti95

Reply 848 of 984, by drosse1meyer

User metadata
Rank Member
Rank
Member
7F20 wrote on 2023-07-31, 15:38:
ViTi95 wrote on 2023-07-30, 16:47:

No, FastDoom doesn't use the FPU at all, so no difference between a 486SX and a 486DX.

Just curious if it using the FPU would make it faster on systems with it present, assuming it is even possible?

Doom uses trig lookup tables to avoid certain operations as they were too slow on most contemporary CPUs. There is code to compute them instead but its commented out.

P1: Packard Bell - 233 MMX, Voodoo1, 64 MB, ALS100+
P2-V2: Dell Dimension - 400 Mhz, Voodoo2, 256 MB
P!!! Custom: 1 Ghz, GeForce2 Pro/64MB, 384 MB

Reply 849 of 984, by Darmok

User metadata
Rank Newbie
Rank
Newbie

Optimization for 8-bit VGA cards is a very interesting and useful feature. This is especially important for using the Herkmap mode with ISA VGA.
Due to the nature of the AT architecture, devices mapped to two adjacent 64 kB segments (A000-BFFFF in our case) must operate in either 8 or 16 bit mode. In the presence of 8-bit Hercules, the BIOS of the VGA card switches it to 8-bit mode, which leads to a significant drop in bus bandwidth and, consequently, FPS.
For my OTI-087 I made a couple of utilities that allow me to switch the OTI to 16 or 8 bit mode at any time. If I don't need to use dual monitor mode, I can turn on 16-bit mode and not lose speed. However, for FastDoom with herkmap, I have to switch to 8-bit mode, in this case, optimization can help.

Reply 850 of 984, by appiah4

User metadata
Rank l33t++
Rank
l33t++

I never thought the day would come when 8-bit cards' performance would be a consideration with regards to playing Doom, but here we are.. 🤣 Thanks for the effort @viti95

Retronautics: A digital gallery of my retro computers, hardware and projects.

Reply 851 of 984, by ViTi95

User metadata
Rank Member
Rank
Member
Darmok wrote on 2023-08-03, 08:09:

Optimization for 8-bit VGA cards is a very interesting and useful feature. This is especially important for using the Herkmap mode with ISA VGA.
Due to the nature of the AT architecture, devices mapped to two adjacent 64 kB segments (A000-BFFFF in our case) must operate in either 8 or 16 bit mode. In the presence of 8-bit Hercules, the BIOS of the VGA card switches it to 8-bit mode, which leads to a significant drop in bus bandwidth and, consequently, FPS.
For my OTI-087 I made a couple of utilities that allow me to switch the OTI to 16 or 8 bit mode at any time. If I don't need to use dual monitor mode, I can turn on 16-bit mode and not lose speed. However, for FastDoom with herkmap, I have to switch to 8-bit mode, in this case, optimization can help.

Really interesting, didn't know of that limitation. This optimization will be only available for backbuffered VGA modes (mode 13h) for now. I'll try to do the same for mode Y, but being planar makes this a bit more difficult, specially since render code already uses all available registers.

EDIT: I'm just stupid. Mode Y uses triple buffering, which reduces a lot the possibility to find pixels already drawn with the correct color, so this optimization will only be available for mode 13h and mode VBR.

appiah4 wrote on 2023-08-03, 08:30:

I never thought the day would come when 8-bit cards' performance would be a consideration with regards to playing Doom, but here we are.. 🤣 Thanks for the effort @viti95

I bet this can lead to insane results with really fast CPU's and any ISA card. The only limit will be the bus transfer speed 😂

https://www.youtube.com/@viti95

Reply 852 of 984, by xcomcmdr

User metadata
Rank Oldbie
Rank
Oldbie

Does FastDoom needs XMS, or EMS ?

I might use it to benchmark an emulator and profile performance over releases, eventually.

(but there would be a lot of INT21H services and other stuff to code first before running other games, let alone DOOM 😁 )

Reply 853 of 984, by ViTi95

User metadata
Rank Member
Rank
Member

FastDoom uses DPMI to manage memory, so only XMS is used (himem.sys)

I also wanted to do the same and see if there is any performance regression, but don't have enought time to do it, so any help here is welcome.

Right now I'm focusing on optimizing backbuffered modes (CGA, EGA, ColorPlus, Hercules...). I've discovered an x86 trick using the LEA instruction that optimizes the 256 color conversion between 5% and 15% depending on the case.

https://www.youtube.com/@viti95

Reply 855 of 984, by digger

User metadata
Rank Oldbie
Rank
Oldbie
ViTi95 wrote on 2023-08-11, 11:52:

FastDoom uses DPMI to manage memory, so only XMS is used (himem.sys)

I also wanted to do the same and see if there is any performance regression, but don't have enought time to do it, so any help here is welcome.

Right now I'm focusing on optimizing backbuffered modes (CGA, EGA, ColorPlus, Hercules...). I've discovered an x86 trick using the LEA instruction that optimizes the 256 color conversion between 5% and 15% depending on the case.

I just love how you (and those helping you) keep coming up with clever tricks to squeeze every last possible drop of performance out of old hardware that was never designed to run a game like Doom. This is incredibly impressive and entertaining.

Thank you for your continued work on this! ☺️

Reply 856 of 984, by ViTi95

User metadata
Rank Member
Rank
Member

Forgot to update about this LEA trick, I released FastDoom 0.9.8 with these optimizations. All backbuffered non-VGA modes should be faster now, the backbuffer conversion still takes time from the CPU but now is a bit better. For the next release I'll be trying to add Hercules InColor support (finally!), but for now it's been proven to be very difficult to work with. Why? It uses a non-linear VRAM plane based layout. This is literally the Hercules design but adding EGA planes. It's the worst design I've ever seen, no wonder it failed so much.

https://www.youtube.com/@viti95

Reply 857 of 984, by Rav

User metadata
Rank Member
Rank
Member

FYI, I FastDoom crash when I save while playing Freedoom Phase 1, E1M5, right in the beginning of the level
With a "Savegame buffer overrun" error.

I Replicate the bug every single time I tested.
From what I understand from reading freedoom stuff, it's a know issue with the Vanilla engine.

Reply 858 of 984, by rasteri

User metadata
Rank Member
Rank
Member

I've been messing with a SuperEGA card that can emulate 640x480 16 color VGA - Re: Video Seven Vega Deluxe - SuperEGA (ish) card

Theoretically other EGA cards with the same chipset could be hacked to support it.

Might be neat to have a dithered 640x480 VGA mode to support these cards, although it would be limited to the EGA palette.

(Also just in general a dithered 640x480 VGA mode would be cool, then you could have adaptive palette or something)

Reply 859 of 984, by ViTi95

User metadata
Rank Member
Rank
Member

I had the idea to implement it in the past, but the realtime conversion from a linear backbuffer to planes is so slow (even in ASM) that is nowhere usable. Also it requires moving lot's of data through the ISA bus (128kb per frame maximum), which limits the framerate to 6-7 fps. That's why I removed the dithered EGA and ATI 640x200 16 color modes, it was cool but basically unusable.

It's just too much for the 8-bit ISA bus.

https://youtu.be/3m85czNLL-8?si=1_ZSY09l_1CQxuIw

https://www.youtube.com/@viti95