VOGONS


Reply 40 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
truth5678 wrote:

The Quake renderer may not be elegant underneath the surface

Depends on what you consider 'elegant' I suppose. I think the renderer is elegant in the sense that it reaches pretty much optimal performance on the Pentium architecture by firing off FPU instructions that overlap with the texturemapping innerloop.

truth5678 wrote:

but I think it would be difficult to improve upon, as you noted, without losing too much visual detail.

Yes, it's a difficult situation.
On the one hand we know that a Quake-ish game should be possible on a 486, since Descent has a very similar engine, but more 486-friendly.
On the other hand, Quake is designed from the ground up to target Pentium, which results in better image quality than Descent (although Descent is certainly still very acceptable).
Where is the right balance between the two? That is difficult to say. Quake levels are larger and more complex than Descent-levels. Would it be possible to have a Descent-like engine that renders Quake-levels? And would a 486 still perform about as well as it does in Descent now?
Or should the Quake level-code be used as-is, and just a more Descent-like renderer plugged in? And if so, how close can you get to Descent-performance with that?

In an ideal world, I would just write a 486-optimized BSP-renderer from scratch. Something that is compatible with the Quake level-data, but designed from the ground up to work on a 486.
But a more realistic course of action I think would be to do what I already suggested before: I think it's a safe assumption that the biggest bottleneck is in the texturemapper, since that is run for every pixel on the screen.
So replacing that with a 486-optimized one would be a good start.
Once that part is optimized, one can look at the next bottleneck, which might be in the handling of the BSP and the transform/lighting code.
So, basically you'd do a 486-rewrite, but stage by stage, going for the biggest bottleneck first.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 41 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++

The second big bottleneck I know is the creating of the surfacecache with blending the lightmap for every 16 texels, in r_surf.c (and surf8.s). I tried to cripple the smoothing in surf8.s as surf8fst.s to make it a tiiiiiiny bit faster, and didn't do the same in C.

I think the model rendering code is very very fast though. Maybe the world could be drawn with the model code? You'd still have to sample the lightmap still to get some lighting data for each vertex 😀

disclaimer: I have an extreme difficulty in understanding any assembly.

Last edited by leileilol on 2015-01-07, 08:31. Edited 2 times in total.

apsosig.png
long live PCem

Reply 42 of 54, by truth_deleted

User metadata

That sounds like an excellent way forward. 😀 I guess that could be prototyped from existing C code.

Also, that's very interesting that they optimized the relative use of the FPU and CPU. It makes it even more difficult to gain optimization from one without also optimizing the other unit.

Edit: is there a way to alter the use of the lightmap for testing (just in the C routine)?

Edit2: oh!

Last edited by truth_deleted on 2015-01-07, 08:32. Edited 1 time in total.

Reply 43 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++

r_fullbright but this still calls the light blending process so there is no real performance improvement. It just forces all the light data to a certain value. r_drawflat on the other hand....... is in C and is the fastest way to 'draw' the world

Last edited by leileilol on 2015-01-07, 08:32. Edited 1 time in total.

apsosig.png
long live PCem

Reply 45 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
leileilol wrote:

The second big bottleneck I know is the creating of the surfacecache with blending the lightmap for every 16 texels, in r_surf.c (and surf8.s). I tried to cripple the smoothing in surf8.s as surf8fst.s to make it a tiiiiiiny bit faster, and didn't do the same in C.

I think the model rendering code is very very fast though. Maybe the world could be drawn with the model code? You'd still have to sample the lightmap still to get some lighting data for each vertex 😀

Descent seems to do all its lighting on-the-fly with a simple shademap.
I wonder how fast Quake would run on a 486 if you'd just skip the surfacecache. So just fixed lighting for all textures, no recalculation.
It would give somewhat of an idea of the balance between the lighting code and the actual rendering code on 486. So we'd know which part is most interesting to optimize/redesign 😀

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 47 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
truth5678 wrote:

That's a good idea, something not too difficult to test.

Yes, I was just thinking... you don't rule out the rest of the level code yet, that way...
But that could probably be done by just creating a very simple level... Say just 1 or 2 simple rooms, so that there is minimal overhead in determining what to draw.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 48 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++

I used to use the entitytest map to test graphical features. Fairly basic wide box map with water and a skybrush that moves, and lots of entities (which might kill a 486 now that I think about it) and a colored lighting file and watervis, which is even more irrelevant for 486ing.

apsosig.png
long live PCem

Reply 49 of 54, by qbism

User metadata
Rank Newbie
Rank
Newbie

Regarding compiler optimization, the Intel embedded compiler gave a 10fps boost to Pocket PC Quake vs free gcc compiler at the time. Either Intel was better optimized, or they knew the secret TurboBoost opcode 😀

I tend to overuse statics, but careful choice of static variables will speed up spans. Otherwise the compiled code might look up the variable every time. Static variables will be stored in registers where possible. It mighr be best to do only the inner loop variables because there are only a few registers. In my current code l've made nearly all span vars static. Someday need to check if cutting back matters for FPS.

Reply 50 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
qbism wrote:

Regarding compiler optimization, the Intel embedded compiler gave a 10fps boost to Pocket PC Quake vs free gcc compiler at the time. Either Intel was better optimized, or they knew the secret TurboBoost opcode 😀

gcc is only good on x86. And only in recent years.

qbism wrote:

I tend to overuse statics, but careful choice of static variables will speed up spans. Otherwise the compiled code might look up the variable every time. Static variables will be stored in registers where possible. It mighr be best to do only the inner loop variables because there are only a few registers. In my current code l've made nearly all span vars static. Someday need to check if cutting back matters for FPS.

Hum, that doesn't sound right. I suppose you mean static as in global variables? As opposed to local ones? (Making local variables static means that they are placed in the global data section, which also means their values will be persistent across multiple calls, the symbols just remain local).
The only difference is that global variables are addressed directly, where local ones are relative to the stackframe (depending on how your compiler is configured, it will address them with ebp or esp + offset).
Regardless of the type of variable, the compiler will try to use registers whenever possible.
Static variables should not have some kind of 'higher priority' when the compiler is allocating registers. So in that sense it shouldn't matter.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 51 of 54, by truth_deleted

User metadata

I noted your idea about r_drawflat, and so when I scanned the source code today for the drawspans, I recognized that parameter which appears to circumvent drawing surfaces across the game world. Ran a quick test for r_drawflat=1 and found a reduction from 30s to 20s in a typical 640x400 timedemo. So, if I understand the technology correctly, then the best scenario would be a 30% better performance if there is no surface drawing. 😀 I verified that I can do significantly better than r_drawflat=1 by reducing the display from 640h to 320h and maintain the surfaces (timedemo 30s vs. 15s).

I think what would be interesting is to compare these differences in the timedemo on a 486. It may tell us a different story than on a Pentium? Would the differences be larger?

I also built leilei's project the other day, and it doesn't scale that well, even if it's fast. Is it possible to try an intermediate resolution, something above what was used but lower than 320h? Does the current project run on a 486?

Reply 52 of 54, by qbism

User metadata
Rank Newbie
Rank
Newbie

The issue I observed with non-static variables is that they are reloaded into registers each time a 'for' or 'while' loop is re-entered, even if the values were not modified outside the loop. It makes a difference in span drawing because there are many variables and many repetitions of the while loop.

I don't know if locally declared statics act as any kind of hint to code optimization. If declared globally they can be reused among functions and save memory.

Reply 53 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
qbism wrote:

The issue I observed with non-static variables is that they are reloaded into registers each time a 'for' or 'while' loop is re-entered, even if the values were not modified outside the loop. It makes a difference in span drawing because there are many variables and many repetitions of the while loop.

Sounds like a very compiler-specific thing... a peephole optimization that works for statics, but not for regular locals.
I doubt you'll see the same behaviour if you use a different compiler.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/