Reply 340 of 862, by Maraakate
- Rank
- Oldbie
Looks cool. I never really quite understood inlining ATT ASM into the C files, but a nicer example like this makes it a lot easier. Thanks!
Looks cool. I never really quite understood inlining ATT ASM into the C files, but a nicer example like this makes it a lot easier. Thanks!
Glad to post it. After some testing, I wonder whether it's humanly possible to replicate Abrash code. 😀
This code is almost working in R_PolysetCalcGradients() and should provide a template for other lines in that function:
// r_lstepx = (int)ceil((t1 * p01_minus_p21 - t0 * p11_minus_p21) * xstepdenominv);
__asm__ ("fld %1\n\t"
"fld %2\n\t"
"fmulp\n\t" // st1
"fld %3\n\t" // 2
"fld %4\n\t" // 3
"fmulp\n\t" // 4
"fxch %%st(4)\n\t"
"fsubrp\n\t"
"fld %5\n\t" // 5
"fmulp\n\t" // 6
"fistp %0" : "=m" (r_lstepx) : "m" (t1), "m" (p01_minus_p21), "m" (t0), "m" (p11_minus_p21), "m" (xstepdenominv)
);
almost working? What is left? We've been working on expanding the ref_soft driver into a DXE now for the possibility (no breath holding, please) of adding 3dfx support as someone ported the glide3x library to DJGPP with hardware rendering about 10 years ago. It will be quite a project since none of us have done opengl or glide programming before.
I think the calculation is not correct yet because some models flicker. 🙁 It's probably an error in keeping track of the stack, but if it eventually works, then it should take advantage of the parallelism between cpu and fpu instructions on a real DOS machine (although Abrash probably profiled the changes so the parallelism was optimal).
I enjoyed seeing your latest additions toward the ref_soft dxe! That's great work. I wonder whether that glide port is the same as Mesa3d did (perhaps version 3.1). I think that's a great idea and it's worth trying. I can advise a bit if it helps.
(also, i think -march and -mtune are overlapping; perhaps -march is sufficient although it requires that architecture or above, whereas -mtune just provides some of that architecture's optimizations while allowing for fallback to older cpus.)
Regarding the march and mtune are you referring to the SSE issues? That's more of a small pet project thing for myself to squeeze a few extra frames out on my P3 550. There's no real plans right now to turn that into a true stable build. But if you know of better compiler flags and still be able to get some extra performance then I'm all for it.
The glide port is related to the earlier Mesa project. It seems around version 6.5.0 is when someone removed it from Mesa. It requires that glide3x.dxe and the glide3x headers from the SDK to compile the mesa version. Sezero suggested we just write our own ref_glide instead of introducing the overhead of mesa converting the opengl back to glide, etc.
I figured a good way to eventually start on it is to recompile the glide3x project (still on sourceforge) and try making a few small test programs from the SDK documents (all available on falconfly) then start taking the ref_gl and stubbing most of it out and work towards at least getting the 2D drawing like HUD, console, fonts, etc. to draw then start adding it back in one by one. I haven't tried compiling the glide3x project yet as it requires perl and I haven't bothered to install it.
I'm also really interested in the glide project because the it should give a dramatic boost in speed and quality. To put it in perspective... Q2DOS runs at about 20-30 fps at 320x240 on my P1 200mhz with an S3 Trident. It also has one voodoo 2 in it. Using the ref_gl in windows i can get about 40-60fps at 640x480 (lowest the drivers will allow) with wickedgl, but I haven't tested the original miniGL drivers. That's still a nice improvement and it will allow newer'ish maps but not larger textures, at least not for a voodoo 3 and earlier.
Also, you should be able to test your code in Windows XP now to get better indications on timedemos. You're limited to sw_mode 1 (320x240 fails for me and so does the banked modes) and no sound. I tried revisiting VDMSound recently, but I can't even get it to work right with wolf3d anymore so maybe some update in Windows XP Post-SP3 killed it. I really don't remember when was the last time I had it working... but it was a few years ago as I was playing Descent over Kali with VDMsound working properly.
I think the ntvdm sound is not very robust and that some later XP updates didn't help (as you mentioned). The vdmsound wasn't much better (on newer XP versions). I also read that NT4 doesn't have sound support, so actually you have full compatibility with NT4 ntvdm. 😀 I was able to run the 640x480 banked mode in some cases, too, but not with all older q2dos versions. I actually tried a few small media players and most didn't play in ntvdm (and they all used similar auto-init dma procedures in their source code). No wonder MS excluded ntvdm altogether in 64-built OS's, even though there are reports that they had multiple paths toward cpu emulation (such as in their RISC OS NT builds with ntvdm).
The SSE builds are a very good feature. I've wondered whether the SSE optimized builds based on C-only code outperform the C/fpu builds on the same system. I don't think there are any additional flags which could improve upon your current ones.
I also read that Mesa had bit rot in their dos/opengl(Voodoo) compatibility with later versions (before it was removed altogether). I've heard that it's not very robust in this scenario, too. I like your idea instead of a refglide interface, especially if there are already code samples available. The increased performance potential is promising.
The 3dfx project should be fun in any case. And it may allow a backport to Q1.
It should easily allow a backport to Q1 which would be a good news to my QDOS port and for leillol's ports as well.
I tried doing straight C compiles before with SSE and it was slower than compiling with ASM in. But I don't remember specifically it was faster compared to regular q2dos in those regards.
After we separated the ref_soft into DXE I found whatever optimizations are happening... the meat is in the ref_soft compiler optimizations somewhere. It may be worth looking into "upgrading" abrash code someday for potential benefits... but I'm getting way ahead of myself here. 😀
Thank you for sharing your results. The performance numbers are interesting. I've also wondered whether there is any need of lookup tables to speed up some calculations and whether precision can be (further) relaxed in some of the program flow.
Leilei once mentioned tinygl as a shim, but it has just a subset of the necessary functions (although I recall that a project, perhaps on the Amiga?, used it successfully to port a game).
As an aside, there is an opengl software rasterizer here: Quake3 with software rasterizer. It compiles in mingw, but I have an older 6.4 version. I have to go on recollection only, but it may just require a simple code change to switch the output mode from "software" to "glide". This method sounds appealing to experiment with.
We're trying to omit any kind of library overhead for this project... but here is the glide3x project if you want to try and fix any code rot it might have in DJGPP:
http://sourceforge.net/projects/glide/
I think when it was made it required DJGPP 2.04 (especially since it uses DXE) so a transition to 2.05 might not be as terrible.
That's a very reasonable approach! I'll familiarize myself with it and try to build it, too. Do you also think it's worthwhile to work on that above asm function or that its impact is minimal? I can switch to glide3x instead (I'm not much of a multitasker, much like DOS).
How close are you on the ASM? I haven't tested any of it yet. Been doing too much binge coding lately, not enough playing the game so I've been doing some minimal touches here and there the past week and sezero has been doing a lot of clean up.
I completely understand.
I think the asm is halfway done at best, but its impact seems nil. I didn't verify outside dosbox, however, but I'm not optimistic that the function has significant impact on P166 class system running the software renderer (much like the particle code).
I just emptied the function and it increased fps ~5% (but that's no code at all in that function). 😀
If you make a regular compile with the current code (can be hard linked for easier distribution) and a separate binary with the latest ASM I can verify it across 3 DOS machines and let you know.
Thanks! I'll attached something within a few minutes. I also downloaded glide3x and the makefile seems in good shape. It creates an error eventually with the newer gcc rules, but the error should be easily fixed.
The test binary is attached, but only used the presumed working asm code.
timedemo speed is almost the same (like 0.5 difference when ran multiple times for each) i noticed particles are actually missing. Maybe blend particle functions aren't working right? Anyways, the q2sse still beats them by 6-7 frames on average on the timedemo and its generally over all a smoother consistent framerate compared to regular q2dos
That's helpful, thanks. At least the above documents the code for later use.
I'll test the glide3 instead. 😀 If I can bypass some errors, I'll post the results for verification.