VOGONS


Reply 760 of 987, by keropi

User metadata
Rank l33t++
Rank
l33t++

I was messing with GZDOOM the other day and the awesome "voxeldoom" addon which reminded me the few voxel items in Blood and how I wished there where more of them in game... so when FastDoomVoxel release?
🤣 🤣 🤣

🎵 🎧 PCMIDI MPU , OrpheusII , Action Rewind , Megacard and 🎶GoldLib soundcard website

Reply 761 of 987, by ViTi95

User metadata
Rank Member
Rank
Member

Yeah the voxel Doom mod is awesome, and the developer is uploading making of videos to YouTube. Maybe someone can port gzdoom to Glide and use it with Voodoo cards 🤤

BTW Frenkel has ported Doom to DJGPP, so now it's possible to compare DJGPP and OpenWatcom builds directly. I cannot test this myself this week (as I'm on vacation right now 😅), maybe someone can do some benchmarks comparing both on 386 or 486 CPUs. https://github.com/FrenkelS/djdoom

https://www.youtube.com/@viti95

Reply 762 of 987, by keropi

User metadata
Rank l33t++
Rank
l33t++

nah, some voodoo gzdoom build will most likely be like the Quake2/DOS port - can't say I look forward to it
the definitive way would be to add low-res voxels support to the classic dos engine , not sure how/if this can be done or if anyone cares enough - just thinking out loud here since I was truly impressed by the voxels addon and really wished they were a thing back then 😀

🎵 🎧 PCMIDI MPU , OrpheusII , Action Rewind , Megacard and 🎶GoldLib soundcard website

Reply 763 of 987, by weedeewee

User metadata
Rank l33t
Rank
l33t
keropi wrote on 2023-04-20, 10:52:

nah, some voodoo gzdoom build will most likely be like the Quake2/DOS port - can't say I look forward to it
the definitive way would be to add low-res voxels support to the classic dos engine , not sure how/if this can be done or if anyone cares enough - just thinking out loud here since I was truly impressed by the voxels addon and really wished they were a thing back then 😀

wrt voxels, they were already a thing back then, though the add on you reference likely wasn't . ... https://www.youtube.com/watch?v=MLcxgYEBAXA Comanche : Maximum Overkill 1992

Right to repair is fundamental. You own it, you're allowed to fix it.
How To Ask Questions The Smart Way
Do not ask Why !
https://www.vogonswiki.com/index.php/Serial_port

Reply 764 of 987, by rasz_pl

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-04-20, 08:19:
rasz_pl wrote on 2023-04-19, 01:01:

so I was reading Cirrus Logic datasheet, as one does 😀 Apparently CL-GD5426/’28/’29 BitBLT lets you draw walls without calculating offsets:

I think this is not useable for rendering since this only copies data, it's not able to scale a column. But it's useful to accelerate copies from the backbuffer to the VRAM in FastDoom modes 13H and VBR. Will take a look in depth.

With this you wouldnt need backbuffer at all. What makes Doom columns slow is drawing 1 byte at a time. Using Cirrus Logic BitBLT you can treat columns like spans and write to video ram linearly 4 bytes at a time. Add full detail spans with storing up to 4 pixels in temp variable (stack if out of registers) to replace 4 ISA writes with one and you dont need dedicated framebuffer. This will only leave slow transparent sprites needing readback, but maybe those could be done with Raster Operation Register.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 765 of 987, by Gmlb256

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-04-20, 10:18:

Maybe someone can port gzdoom to Glide and use it with Voodoo cards 🤤

Not worth the effort. It will be more like the Q2DOS source port as keropi said, using a Glide wrapper (fxMesa or something similar) to handle the OpenGL routines.

VIA C3 Nehemiah 1.2A @ 1.46 GHz | ASUS P2-99 | 256 MB PC133 SDRAM | GeForce3 Ti 200 64 MB | Voodoo2 12 MB | SBLive! | AWE64 | SBPro2 | GUS

Reply 766 of 987, by leileilol

User metadata
Rank l33t++
Rank
l33t++

turning doom 3d requires preprocessing maps as well, unless you try to go for the slow every-span-is-a-poly route the way those 3dfx sw/blood ports do (which famously do not support voxels) and then it'd have little benefit and lots of thrash and it's all counter to the fastdoom mission, and that's not even thinking about the bloated hx way of q2dos doing things.

Also if you backport gzdoom to something, you'll also run the risk of having newer mods broken because there's always some successor or major change in the works of the game logic (decorate, zscript, etc), and maybe certain invested v2sli users complaining their alleged 24mb card can't do hq4x textures defying the big forsaken number bar they believed in.

A Rendition port would be more realistic given how they worked 😉

apsosig.png
long live PCem

Reply 767 of 987, by DracoNihil

User metadata
Rank Oldbie
Rank
Oldbie
leileilol wrote on 2023-04-21, 00:47:

A Rendition port would be more realistic given how they worked 😉

I'm very curious how Rendition Verite cards work now, from what you just said. I've never owned a single one of these, since my late father bought into 3dfx early on first, so the other accelerators at the time we've completely looked past.

A bit off-topic but I really hated the 3dfx Glide versions of Blood and Shadow Warrior. So I have to wonder what a Rendition Doom would end up doing to the rendering engine as a whole.

“I am the dragon without a name…”
― Κυνικός Δράκων

Reply 768 of 987, by rasz_pl

User metadata
Rank l33t
Rank
l33t

continuing Cirrus Logic BitBLT.
"For system-to-screen BitBLTs, up to three bytes of the last transfer for each scanline is ignored (depending on width). The next scanline begins with the next DWORD transfer."
86box https://github.com/86Box/86Box/blob/master/sr … eo/vid_cl54xx.c registers do not match datasheet so no idea whats going on. Mame has full support https://github.com/mamedev/mame/blob/cf7a75a8 … gd542x.cpp#L428 and sadly it turns out BitBLT is not as clever 🙁 Automagic address advance works only on 4 byte boundaries, every new scanline needs its own Dword transfer. Blah useless 😒

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 769 of 987, by ViTi95

User metadata
Rank Member
Rank
Member
rasz_pl wrote on 2023-04-21, 10:14:

continuing Cirrus Logic BitBLT.
"For system-to-screen BitBLTs, up to three bytes of the last transfer for each scanline is ignored (depending on width). The next scanline begins with the next DWORD transfer."
86box https://github.com/86Box/86Box/blob/master/sr … eo/vid_cl54xx.c registers do not match datasheet so no idea whats going on. Mame has full support https://github.com/mamedev/mame/blob/cf7a75a8 … gd542x.cpp#L428 and sadly it turns out BitBLT is not as clever 🙁 Automagic address advance works only on 4 byte boundaries, every new scanline needs its own Dword transfer. Blah useless 😒

The linux kernel Cirrus Logic FrameBuffer device uses BitBLTs to accelerate screen copies and scroll, maybe we can take a look to that code to understand a bit better how it works https://raw.githubusercontent.com/torvalds/li … bdev/cirrusfb.c

https://www.youtube.com/@viti95

Reply 770 of 987, by rasz_pl

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-04-23, 20:46:
rasz_pl wrote on 2023-04-21, 10:14:

continuing Cirrus Logic BitBLT.
"For system-to-screen BitBLTs, up to three bytes of the last transfer for each scanline is ignored (depending on width). The next scanline begins with the next DWORD transfer."
86box https://github.com/86Box/86Box/blob/master/sr … eo/vid_cl54xx.c registers do not match datasheet so no idea whats going on. Mame has full support https://github.com/mamedev/mame/blob/cf7a75a8 … gd542x.cpp#L428 and sadly it turns out BitBLT is not as clever 🙁 Automagic address advance works only on 4 byte boundaries, every new scanline needs its own Dword transfer. Blah useless 😒

The linux kernel Cirrus Logic FrameBuffer device uses BitBLTs to accelerate screen copies and scroll, maybe we can take a look to that code to understand a bit better how it works https://raw.githubusercontent.com/torvalds/li … bdev/cirrusfb.c

They are copying screen around to scroll, VGA memory to VGA memory.
But I had another round of thinking about it.
""For system-to-screen BitBLTs, up to three bytes of the last transfer for each scanline is ignored" means we cant use it directly for ram to VGA transfers. But you could modify https://github.com/viti95/FastDoom/blob/54662 … b/r_draw.c#L147 (yes, I know you rewrote it in assembly, but C code is more readable so im linking to old version) to scale 4 pixels at a time and write it in DWORD chunks to linear off screen VGA ram buffer, then fire screen-to-screen BitBLT

"The destination pitch and source pitch are the values that are added to the respective
addresses after each width bytes of destination have been processed. Destination and
source pitch are specified separately. When an area is a rasterized image, the respective
pitch is the number of bytes between vertically adjacent pixels. This is the number of bytes
between the (first) pixels of scanline n and scanline n+1; the number that is added to the
address to get from one scanline to the next. When an area is off-screen display memory, it
is often stored so that scanlines are in contiguous locations. This minimizes fragmentation.
In this case, the respective pitch would be set equal to the width (+1)
."

destination pitch = 320
source pitch = 1
width 1
means vga-to-vga bitblit can 90 Degree Rotate automagically with no alignment restrictions.

Now the question is can you write to VGA ram while bitblit is in progress?
"While the BLT is in progress, the CL-GD542X display memory and BLT registers (except GR31) must not be accessed for read or write."
No you cant 🙁 This means each R_DrawColumn should be interleaved with another piece of code. There is a chance https://github.com/viti95/FastDoom/blob/20688 … M/r_segs.c#L329 all of the precomputation between calling each colfunc() is enough delay and CL-GD542X bitblit is fast enough to execute in time on processors slow enough to benefit from this. That would mean ALL writes to VGA ram could be done in DWORDs without using backbuffer. Only invisible sprites would be a problem, but those maybe could be faked with one of blitter ROP operations.

TLDR: R_DrawColumn with call to bitblit at the end and let GD542X do its thing in the background while you are preparing the next R_DrawColumn.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 771 of 987, by ViTi95

User metadata
Rank Member
Rank
Member

Maybe I'm still not understanding how it could be done rendering with a BitBLT engine. I've been reading multiple datasheets and code for BitBLT engines (Cirrus Logic 542x, S3 928, Trident 9440...), and basically all of them only make 1-to-1 ratio copies, so even if the textures are pre-loaded into the VRAM we cannot do scaled blitting from VRAM-to-VRAM.

I only see this useable to accelerate 2D parts of the game, such as the status bar or the on-screen messages. The other thing that could be used to accelerate the game is using rectangle fill functions to accelerate untextured visplane rendering

https://www.youtube.com/@viti95

Reply 772 of 987, by rasz_pl

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-04-25, 11:16:

Maybe I'm still not understanding how it could be done rendering with a BitBLT engine.

Afaik you cant accelerate texture scaling with period correct 2D accelerators. Some had special provisions for scaling, but in weird ways like in the DAC with overlays etc. What you can do is turn R_DrawColumn slow byte writes into linear dword writes. Instead of optimizing number of CPU cycles in R_DrawColumn we are optimizing for minimum number of writes to ISA mapped address space. Potential for 2-4 times less ISA transactions depending on if DWORD write to ISA VGA are faster than two WORD writes. All of this is only relevant in full detail mode:

- Question is how much % of time is spend writing to ISA mapped address space? You could probably emulate potential gains by modifying DOOM to only render 1/4 of screen directly to VGA ram and 3/4 to ram buffer. Will this gain a lot? Is this gain worth implementing support for hardware blitters?
- Second question is how much time is spend inside R_DrawColumn in average frame? Napkin math says there is a lot of walls on the screen, probably 2x more than ceilings/floors.
- Third question - what about R_DrawSpan? It also writes in bytes despite potential for calculating 4 texels at a time and doing DWORD writes. Perhaps instead of implementing hardware blitter a smaller experiment modifying R_DrawSpan would be more prudent.

Manually reengineering R_DrawColumn/R_DrawSpan into writing 4 bytes at a time should give similar/better result to the gains of enabling MRTT Write Combining on PCI. Imo MRTT cant help with columns, but will happily work with R_DrawSpan.
I dont think this will have much impact on VLB, as at that point you usually have enough CPU for full locked 35fps and Bus speed is not a bottleneck, but one never knows.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 773 of 987, by ViTi95

User metadata
Rank Member
Rank
Member

Some quick testing with latest FastDoom dev build, on a Toshiba T2130CS (486DX4-75, PCI C&T 65545 (1Mb), 12Mb of ram and no L2 cache). I've used SciTech Display Doctor 6.53 to enable VESA modes and LFB. The screen size used is fullscreen + HUD (no border) and sound/music disabled.

fdoom095.png
Filename
fdoom095.png
File size
25.95 KiB
Views
1312 views
File comment
Quick benchmark FastDoom 0.9.5
File license
CC-BY-4.0

In this case we can see that Mode 13H is faster compared to Mode X and Mode VBD in high detail, maybe because the 25 MHz bus is better used with 32-bit copies from backbuffer to VRAM. Also we can see that writting less data to the video card is plain faster, as the difference in potato detail is huge compared in Mode X compared to the other modes. I guess it's ok to think that less writes to the VRAM is faster, and that 32-bit copies from RAM are faster than 8-bit direct writes to VRAM.

https://www.youtube.com/@viti95

Reply 774 of 987, by rasz_pl

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-04-25, 22:35:

In this case we can see that Mode 13H is faster compared to Mode X and Mode VBD in high detail, maybe because the 25 MHz bus is better used with 32-bit copies from backbuffer to VRAM.

Just to recap in High X and VBE its direct byte writes to vga ram, the only difference is in VGA chipset video mode handling. 13h is using backbuffer.
Not a pentium2 so no MRTT so BYTE writes should hurt just as much as DWORD ones.

ViTi95 wrote on 2023-04-25, 22:35:

Also we can see that writting less data to the video card is plain faster, as the difference in potato detail is huge compared in Mode X compared to the other modes. I guess it's ok to think that less writes to the VRAM is faster, and that 32-bit copies from RAM are faster than 8-bit direct writes to VRAM.

Very difficult to get clues from this, so many unknowns and moving parts. PCI card, maybe comparable with VLB? inapplicable to ISA case 🙁 Cant really get any clues about ISA due to vastly different bus throughput and we arent answering the question about the difference between 2 consecutive WORD ISA writes versus 1 DWORD one.
We dont know anything about % of R_DrawColumn to R_DrawColumn/R_DrawSpan. Blitter can only potentially speed up R_DrawColumn by cutting number of ISA transactions, it could maybe also enable doing low detail 160x100 video mode by quickly doubling lines.

For convenience sake I pretend Sprites dont exist (no overdraw and no readback). Dropping detail level cuts number of R_DrawColumn calls and length of each R_DrawSpan.

High:
X 64K VGA ram writes in 64K PCI transactions
13h 64K BYTE sys ram writes + 16K sys ram reads on 16 KB cache system + 16K VGA ram writes in 16K PCI transactions
VBD 64K VGA ram writes in 64K PCI transactions

Low:
X 2x less R_DrawColumn/R_DrawSpan, 32K VGA ram writes in 32K PCI transactions
13h 2x less R_DrawColumn/R_DrawSpan, 32K WORD sys ram writes + 16K sys ram reads on 16 KB cache system + 16K VGA ram writes in 16K PCI transactions
VBD 2x less R_DrawColumn/R_DrawSpan, 32K VGA ram writes in 32K PCI transactions

Potato:
X 4x less R_DrawColumn/R_DrawSpan, 16K VGA ram writes in 16K PCI transactions
13h 4x less R_DrawColumn/R_DrawSpan, 16K DWORD sys ram writes + 16K sys ram reads on 16 KB cache system + 16K VGA ram writes in 32K PCI transactions
13h 4x less R_DrawColumn/R_DrawSpan, 32K WORD sys ram writes + 16K sys ram reads on 16 KB cache system + 16K VGA ram writes in 16K PCI transactions
VBD 4x less R_DrawColumn/R_DrawSpan, 16K VGA ram writes in 16K PCI transactions
VBD 4x less R_DrawColumn/R_DrawSpan, 32K VGA ram writes in 32K PCI transactions

What I can read from this:
- X high mode 4x less PCI transactions compensates for 64K sys ram writes + 16K reads enough to be 20% faster suggesting PCI transactions are still a huge bottleneck 😮
- X high/low ~70% difference due to 2x less R_DrawColumn/R_DrawSpan, 2x less PCI transactions
- X low/potato ~55% difference due to 2x less R_DrawColumn/R_DrawSpan, 2x less PCI transactions
- 13h high/low ~32% difference due to 2x less R_DrawColumn/R_DrawSpan, 2x less ram writes
- 13h low/potato ~20% difference due to 2x less R_DrawColumn/R_DrawSpan
- VBD high/low ~70% difference due to 2x less R_DrawColumn/R_DrawSpan, 2x less PCI transactions
- VBD low/potato ~10% difference due to 2x less R_DrawColumn/R_DrawSpan

1 Cutting number of R_DrawColumn/R_DrawSpan and PCI transactions makes huge difference, while cutting just R_DrawColumn/R_DrawSpan is ~2x less beneficial.
2 Potato not scaling in 13h/VBD because we only save little CPU while retaining same number of VGA transfers.
3 Why such a huge gap between X and 13h in potato? Both do same amount of computation and PCI transactions, and X has to read back slow VGA ram for invisible sprites. Is 32K WORD sys ram backbuffer writes so costly on quite peppy DX4-75?

Last edited by rasz_pl on 2023-04-26, 17:09. Edited 1 time in total.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 775 of 987, by ViTi95

User metadata
Rank Member
Rank
Member

Some clarifications:

- Potato mode in mode 13h / VBD is using two 16-bit writes to RAM backbuffer / VRAM for each pixel drawn.
- Mode X benefits from VGA planes, in potato mode it's possible to write 4 pixels with just a single byte. This optimization is not possible on VBD modes (that's why I didn't add support for low/detail modes in modes 13h/VBD till now)
- Mode X and mode VBD have the problem of drawing more data to VRAM than backbuffered modes, just because sprites are drawn over already drawn scenery.
- I used timedemo demo3 from Ultimate Doom. This demo has multiple frames with invisible objects rendering. This is also slower on Mode X / Mode VBD due to reads required from the VRAM.

Maybe this answers questions 2, 3 and 4 😅

https://www.youtube.com/@viti95

Reply 776 of 987, by rasz_pl

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-04-26, 06:22:

- Potato mode in mode 13h / VBD is using two 16-bit writes to RAM backbuffer / VRAM for each pixel drawn.
- Mode X benefits from VGA planes, in potato mode it's possible to write 4 pixels with just a single byte. This optimization is not possible on VBD modes (that's why I didn't add support for low/detail modes in modes 13h/VBD till now)

Hmm I think I might have just been remembering Re: FastDoom. A new Doom port for DOS, optimized to be as fast as possible for 386/486 personal computers!
its 16bit all right
https://github.com/viti95/FastDoom/blob/509cd … vbe2dp.asm#L111
https://github.com/viti95/FastDoom/blob/509cd … inearp.asm#L113
https://github.com/viti95/FastDoom/blob/509cd … vbe2dp.asm#L230
https://github.com/viti95/FastDoom/blob/509cd … inearp.asm#L233
wonder how M-HT change would do on PCI/VLB cards. Kinda pointless from practical point of view as anyone with PCI/VLB would have computer fast enough for High detail 😀 but its definitely possible and has potential of being much faster. Also as stated earlier Im wondering about potential _difference between 2 consecutive WORD ISA writes versus 1 DWORD one_, maybe there is something to be won here.

so its Potato:
13h 4x less R_DrawColumn/R_DrawSpan, 32K WORD sys ram writes + 16K sys ram reads on 16 KB cache system + 16K VGA ram writes in 16K PCI transactions
VBD 4x less R_DrawColumn/R_DrawSpan, 32K VGA ram writes in 32K PCI transactions

Ill edit my conclusions in previous post

ViTi95 wrote on 2023-04-26, 06:22:

- Mode X and mode VBD have the problem of drawing more data to VRAM than backbuffered modes, just because sprites are drawn over already drawn scenery.

I consciously decided to ignore sprites for simplicity sake 😀. Better instrumentation would help a lot here in understanding how much more load is due to sprites.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 777 of 987, by ViTi95

User metadata
Rank Member
Rank
Member

Time for a new release. FastDoom 0.9.5.

Changelog:
* Sigma Designs Color 400 support. 320x200 and 16 colors. Tested only in 86Box, as I cannot make my card work on my 386/486 PCs.
* Added low and potato detail for backbuffered and VBE2 direct modes
* Removed FDOOME80.EXE and FDOOMEW1.EXE, not needed with new detail modes. They were also very half-baked, HUD was basically non-readable
* Removed FDOOMV16.EXE, didn't fit properly in the project. Is a cool hack to get VGA 160x200 and 16-colors, but Doom already supports 256 color modes which also are much faster
* Removed FastDoom VGA vertical mode. Started as a joke, and I don't feel it matches the project spirit. If people want's it back, I'll reconsider it.
* Speed up CGA, EGA, Plantronics and Hercules video modes by optimizing in ASM the backbuffer->VRAM copy routines
* Better benchmarking options, now FastDoom is able to generate CSV data (easier to import in Excel)
* Added BENCH.BAT script. This script makes easier to benchmark FastDoom. How to use it:
--* bench.bat {type} {exe} {iwad} {lmp}
--* {type}: {phil, quick, normal, full}
----* {phil}: Same benchmarks as define by PhilsComputerLab DOOM benchmark
----* {quick}: Full screen + HUD, tests potato, low and high detail modes
----* {normal}: Full screen + HUD, tests potato, low and high detail modes, and different visplane rendering modes
----* {full}: Multiple screen sizes, tests all detail settings (will take a long time if the system is slow)
--* {exe}: {fdoom.exe, fdoom13h.exe, ...}
--* {iwad}: {doom1.wad, doomu.wad, doom2.wad, ...}
--* {lmp}: {demo1, demo2, ...}
--* Example: bench.bat phil fdoom.exe doom1.wad demo3
--* Results are stored in the file "bench.csv"
* New invisible object rendering mode, now it's possible to use Heretic/Hexen translucency for objects. Cached tintmap files are stored in binary .TCF files. This rendering method only looks great on 256-color modes.
* Sound FX support for OPL2LPT and OPL3LPT devices.
* Enabled Ensoniq Soundscape music and sound FX devices. Not tested as I don't have one of these devices.
* Renamed multiple command line arguments. Take a look at the README.TXT to see what has changed.
* Fixed issue #139
* Removed "-simplestatusbar" command line parameter. Performance difference was pretty much none, and looked terribly bad.
* Cleanup unused Extended MIDI support from Apogee Sound System.
* Small optimizations for rendering code

Grab it here:

https://github.com/viti95/FastDoom/releases/tag/0.9.5

https://www.youtube.com/@viti95

Reply 779 of 987, by maxtherabbit

User metadata
Rank l33t
Rank
l33t
ViTi95 wrote on 2023-04-28, 10:44:
Time for a new release. FastDoom 0.9.5. […]
Show full quote

Time for a new release. FastDoom 0.9.5.

Changelog:
* Sigma Designs Color 400 support. 320x200 and 16 colors. Tested only in 86Box, as I cannot make my card work on my 386/486 PCs.
* Added low and potato detail for backbuffered and VBE2 direct modes
* Removed FDOOME80.EXE and FDOOMEW1.EXE, not needed with new detail modes. They were also very half-baked, HUD was basically non-readable
* Removed FDOOMV16.EXE, didn't fit properly in the project. Is a cool hack to get VGA 160x200 and 16-colors, but Doom already supports 256 color modes which also are much faster
* Removed FastDoom VGA vertical mode. Started as a joke, and I don't feel it matches the project spirit. If people want's it back, I'll reconsider it.
* Speed up CGA, EGA, Plantronics and Hercules video modes by optimizing in ASM the backbuffer->VRAM copy routines
* Better benchmarking options, now FastDoom is able to generate CSV data (easier to import in Excel)
* Added BENCH.BAT script. This script makes easier to benchmark FastDoom. How to use it:
--* bench.bat {type} {exe} {iwad} {lmp}
--* {type}: {phil, quick, normal, full}
----* {phil}: Same benchmarks as define by PhilsComputerLab DOOM benchmark
----* {quick}: Full screen + HUD, tests potato, low and high detail modes
----* {normal}: Full screen + HUD, tests potato, low and high detail modes, and different visplane rendering modes
----* {full}: Multiple screen sizes, tests all detail settings (will take a long time if the system is slow)
--* {exe}: {fdoom.exe, fdoom13h.exe, ...}
--* {iwad}: {doom1.wad, doomu.wad, doom2.wad, ...}
--* {lmp}: {demo1, demo2, ...}
--* Example: bench.bat phil fdoom.exe doom1.wad demo3
--* Results are stored in the file "bench.csv"
* New invisible object rendering mode, now it's possible to use Heretic/Hexen translucency for objects. Cached tintmap files are stored in binary .TCF files. This rendering method only looks great on 256-color modes.
* Sound FX support for OPL2LPT and OPL3LPT devices.
* Enabled Ensoniq Soundscape music and sound FX devices. Not tested as I don't have one of these devices.
* Renamed multiple command line arguments. Take a look at the README.TXT to see what has changed.
* Fixed issue #139
* Removed "-simplestatusbar" command line parameter. Performance difference was pretty much none, and looked terribly bad.
* Cleanup unused Extended MIDI support from Apogee Sound System.
* Small optimizations for rendering code

Grab it here:

https://github.com/viti95/FastDoom/releases/tag/0.9.5

anything on the super quiet OPL music?