VOGONS


Reply 20 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++

I'd suggest these improvements to be done on a blank-slate Winquake engine with a DOS makefile and tested through PCem (since you may not have a 486 to see how the FPU cycles can choke)

apsosig.png

Reply 21 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
leileilol wrote:
meisterister wrote:

Second of all, I think that it would be an interesting project (that I would start if I had the time and/or skill) to try to write a quake engine specifically optimized for the 486.

I'd really like to see this. Here's some nice inspiration to get you going with someone doing an excellent job at 'porting' Quake2 to a inferior-to-486 computer 😀

Yes, the problem with Quake is the perspective-correct texture-mapping, which uses an fdiv for every 16 pixels. On a Pentium-class CPU this is fast, on a 486 it's horribly inefficient.
So to get a Quake-like game working on a 486-class computer, you shouldn't use this type of perspective-corrected texture-mapping. You should either stick to linear texturemapping, or perhaps some other clever approximation of perspective.
The closest thing to a Quake-engine I know, which runs well on a 486, is Descent: http://youtu.be/0MXc_olWk_s
I'm not sure how they did the texturemapping on that one. They did not do subpixel-correction like Quake does at least. The texturing looks like it is perspective-correct, but.. it seems that they only use quads, not triangles. Perhaps they 'get away' with bilinear texturemapping on quads, by keeping the quads small enough to avoid visible distortion.

Edit: apparently there is some info on wiki: en.wikipedia.org/wiki/Descent_(video_game)

The default engine uses a software renderer in which the perspective transformation for texture mapping is only performed once every 32 pixels, causing textures to appear to pop or shift when viewed from certain angles.

So the approach is similar to Quake, just per 32 pixels rather than 16. In that case it may be interesting to modify the Quake renderer to 32-pixel intervals, and see what it does on 486.
And apparently the source code to Descent is available as well: http://www.descent2.com/ddn/sources/descent1/
Sadly the download links are dead... but the code should be available somewhere on the net 😀 Here is a Github that seems to host it: https://github.com/drguildo

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 22 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++

The Descent code wouldn't be very useful anyway due to a license conflict IIRC. Probably have a better chance whacking the Abrash span ASM to do 32 pixels than backporting Descent texturing code. Maybe there can be some vertical skipping too

apsosig.png

Reply 23 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
leileilol wrote:

The Descent code wouldn't be very useful anyway due to a license conflict IIRC.

Well no, but the code can be studied, and might give some 'inpsiration' for making the Quake-renderer faster on 486.
But I think with two simple changes you should already get quite far:
1) Make the rasterizer perform perspective correction per 32 pixels instead of 16.
2) Reduce the texture resolution to make better use of the small L1 cache on a 486 (Pentium has 16K, most 486 have 8K).

I don't think removing the subpixel correction would gain all that much, since it's only a bit of per-poly setup, and Quake is very lowpoly. I've done subpixel-corrected renderers on 486 back in the day, and got away with it just fine.
Likewise, I would guess that Descent uses more fixedpoint arithmetic where Quake will use more float. But since that will be per-poly overhead at best, there probably isn't that much to gain, and it'd be quite a bit of work to rewrite that sort of code anyway.
So I think the above two changes will be relatively simple, and probably get you 90% of a fully optimized 486 renderer anyway.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 24 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++

Messing around with the vanilla span C code I tried to create 32 and 64 pixel variations 😁

EDIT: quick PCem AM486DX4-120MHz "a bit cache" Fast VLB/PCI S3Trio64 -nosound 320x200 start timerefresh testing

d_subdiv16 0 (ASM): 10.543433
d_subdiv16 1 (default - ASM): 11.253286
d_subdiv16 2 (PocketQuake snippet from the end of page 1 in this thread, C): 9.981296
d_subdiv16 3 (32 pixels, C, not unrolled): 9.386160
d_subdiv16 4 (64 pixels, C, not unrolled): 9.521786

To gain a true speed advantage i'd have to tear the ASM apart 🙁

also had no luck making a 64 pixel version of the pocketquake fixedpoint span drawer

Attachments

  • subdivtest.gif
    Filename
    subdivtest.gif
    File size
    53.47 KiB
    Views
    2078 views
    File license
    Fair use/fair dealing exception

apsosig.png

Reply 25 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++

Still can't get the assembly of the 32 pixel spans working, but I did backport the low detail screen stuff from Engoo and fixed a bug in there to align itself properly. The framerate jump's much larger with that.

"Low detail" is rather similar to the same feature in Duke3D - render some 160x100 view and stretch it to the screen. I tried to do 160x200 (ala Doom) but ran into aspect issues with the perspective and particles

I also backported the faster simpler lightmap code from Engoo which is modified assembly to skip the smoothening of the luxels, which makes performance a bit more consistent on the 486. not a huge jump but still a gain.

I may commit these experiments to this branch and only the DOS version will be maintained.

apsosig.png

Reply 26 of 54, by qbism

User metadata
Rank Newbie
Rank
Newbie

The heart of the spans loop contains (at least) two potential areas of improvement.

pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;

The first one is simple: Although it is referenced repeatedly and never changes within drawspans, cachewidth is non-static because it's used across more than one code file. It can be assigned to a static 'cachewidth_local' at the top of the function.
The second is the multiplication x cachewidth itself. The asm spans code precalculates the 'advancetable' to avoid this. If this could be efficiently replicated in C, it would bring it a step closer to asm performance.

Reply 27 of 54, by qbism

User metadata
Rank Newbie
Rank
Newbie

I haven't tried this FP version below, but it's based on a non-FP spans experiment. The idea is to pre-multiply t and tstep with cachewidth so it can be removed in the loop. I expect the precision loss is too much even in 320x200. It's really just a precursor to a more refined approach that would preserve adequate precision in tstep. If this hack doesn't result in a performance improvement, then there's no point to refine it. Optimized gcc for a P4 processor resulted in no FPS gain, so this is where I'll leave it off. It may be a different story on a 486 where multiplications are expensive.

void D_DrawSpans16_FP (espan_t *pspan)  //qbism from PocketQuake
{
static fixed16_t fp_cachewidth; //qb: static fixed-point cachewidth

int count, spancount, spancountminus1;
unsigned char *pbase, *pdest;
fixed16_t s, t, tp;
int zi, sdivz, tdivz, sstep, tstep;
int snext, tnext;
fp_cachewidth = cachewidth * 0x10000; //qb: static fixed-point cachewidth


pbase = (unsigned char *)cacheblock;
//Jacco Biker's fixed point conversion

// Recalc fixed point values
UpdateFixedPointVars16( 1 );
do
{
pdest = (unsigned char *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);

// calculate the initial s/z, t/z, 1/z, s, and t and clamp
sdivz = sdivzorig + pspan->v * sdivzstepv + pspan->u * sdivzstepu;
tdivz = tdivzorig + pspan->v * tdivzstepv + pspan->u * tdivzstepu;
zi = d_ziorigin_fxp + pspan->v * d_zistepv_fxp + pspan->u * d_zistepu_fxp;
if (zi == 0) zi = 1;
s = (((sdivz << 8) / zi) << 8) + sadjust; // 5.27 / 13.19 = 24.8 >> 8 = 16.16
if (s > bbextents) s = bbextents; else if (s < 0) s = 0;
tp = (((tdivz << 8) / zi) << 8) + tadjust;
if (t > bbextentt) t = bbextentt; else if (t < 0) t = 0;

//End Jacco Biker mod

// Manoel Kasimier - begin
count = pspan->count >> 4;
spancount = pspan->count % 16;
// Manoel Kasimier - end
//count = pspan->count;
// if (count >= 16)
// spancount = 16;
//else
//spancount = count;

while (count-- >0) // Manoel Kasimier
{
// calculate s/z, t/z, zi->fixed s and t at far end of span,
// calculate s and t steps across span by shifting
sdivz += sdivzstepu_fix;
tdivz += tdivzstepu_fix;
zi += zistepu_fix;
if (!zi) zi = 1;

snext = (((sdivz<<8)/zi)<<8)+sadjust;
if (snext > bbextents)
snext = bbextents;
else if (snext < 16)
snext = 16; // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture

tnext = (((tdivz<<8)/zi)<<8) + tadjust;
if (tnext > bbextentt)
Show last 92 lines
               tnext = bbextentt;
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps

sstep = (snext - s) >> 4;
// tstep = (tnext - t) >> 4;
tstep = ((tnext - tp) >> 20) * fp_cachewidth; //qb: premultiply cachewidth
t = (tp >> 16) * fp_cachewidth;//qb: premultiply cachewidth

pdest += 16;
pdest[-16] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep; /qb: removed: * cachewidth
pdest[-15] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[-14] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[-13] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[-12] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[-11] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[-10] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -9] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -8] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -7] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -6] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -5] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -4] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -3] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -2] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
pdest[ -1] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
// Manoel Kasimier - end

s = snext;
t = tnext;
// Manoel Kasimier - begin
}
if (spancount > 0)
{
// Manoel Kasimier - end
// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so
// can't step off polygon), clamp, calculate s and t steps across
// span by division, biasing steps low so we don't run off the
// texture

spancountminus1 = spancount - 1;
sdivz += sdivzstepu * spancountminus1;
tdivz += tdivzstepu * spancountminus1;
zi += d_zistepu_fxp * spancountminus1;
//if (!zi) zi = 1;
//z = zi;//(float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (((sdivz<<8) / zi)<<8) + sadjust;
if (snext > bbextents)
snext = bbextents;
else if (snext < 16)
snext = 16; // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture

tnext = (((tdivz<<8) / zi)<<8) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps

if (spancount > 1)
{
sstep = ((snext - s)) / ((spancount - 1));
// tstep = ((tnext - t)) / ((spancount - 1));
tstep = (((tnext - t) / (spancount - 1)) >> 16) * fp_cachewidth; //qb: premultiply cachewidth
}
t = (t >> 16) * fp_cachewidth; //qb: premultiply cachewidth

pdest += spancount;
switch (spancount)
{
case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 9: pdest[ -9] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 8: pdest[ -8] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 7: pdest[ -7] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 6: pdest[ -6] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 5: pdest[ -5] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 4: pdest[ -4] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 3: pdest[ -3] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 2: pdest[ -2] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
case 1: pdest[ -1] = pbase[(s >> 16) + (t >> 16)]; s += sstep; t += tstep;
break;
}

}
} while ((pspan = pspan->pnext) != NULL);
}

Reply 28 of 54, by Scali

User metadata
Rank l33t
Rank
l33t

I wonder how well a C compiler will work with code like this.
The texture addressing can be done quite efficiently in assembly. If you don't modify the assembly version, chances are, your optimizations work, but are barely enough to win back the time wasted by the C compiler in the innerloop.
An innerloop I once wrote for 486 looks like this for a single texture fetch:

				adc		eax, [dd1]
adc edx, [dd2]
mov bh, al ; VV
mov bl, dl ; UU

mov bl, [ebx][esi]
mov [edi], bl

Instead of all that shifting and adding, I just make use of partial registers (there are no stalls for that on in-order CPUs, so Pentium and earlier).
A compiler will never come up with something like that.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 30 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
leileilol wrote:

Crappy advancetable attempt with 64 pixel spans

Very nice! More importantly though... will it improve performance on a 486? 😀
Sadly I have too many projects/goals atm, but otherwise I would love to have a crack at the original Quake source, and see what I can make my 486dx2-80 do with it.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 32 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
leileilol wrote:

beating d_draw16.s into 32pixel and fixed point would be something, though

This one? https://github.com/Blzut3/Engoo/blob/master/asm/d_draw16.s
Is that the actual code as written by ID, or has it been modified?
It's not immediately obvious what the advancetable does, seems like the textures are not stored linearly, but in some kind of tiled cache-friendly version?
I wonder how effective that will be on 486 in the first place... perhaps KISS is a good idea 😉
Fun code, but would take some time to fully digest and restructure for 32-pixel runs.

Perhaps it would be more useful to look at the Descent-code first, since that is our benchmark.
There's a lot of different versions here though: https://github.com/drguildo/Descent/tree/master/TEXMAP
A first quick glance seems to indicate that they are indeed considerably simpler than the Quake one.
Doesn't seem to have heavy loop unrolling, and by the looks of it, the textures are just linear 256x256 textures.

Funny enough it seems the code I pasted above from my own 486 renderer is actually tighter than this.
That renderer can be found here btw: http://youtu.be/xE9iifKXvY4
See description for download link.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 33 of 54, by truth_deleted

User metadata

I haven't looked at the specific optimizations for pocketquake, and whether they have all been ported over and tested, but I searched for past discussions on the various forums. There was also one discussion about optimizing the levels themselves, but they were not optimistic that it would help performance much.

From my limited testing, I certainly found that Abrash x86 code always wins against the C compiler. 😀 Also, once an assembly routine is removed, it has effects on the rest of the code base, so each piece does not stand by itself in terms of performance. This is probably obvious to others. It seems reasonable to reduce the cpu cost in the existing assembly routines or else lower the display resolution (or a combination of both).

The above benchmarks in the C routines suggest that further optimizations to the x86 code will have limited effect.

Reply 34 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
truth5678 wrote:

From my limited testing, I certainly found that Abrash x86 code always wins against the C compiler. 😀

Well, that should be pretty obvious 😀
For Pentium Pro and newer, compilers can generate quite decent code, because the optimization rules are easy to implement algorithmically, and out-of-order-execution does the rest.
For Pentium and earlier though, nothing beats a skilled assembly programmer. And Abrash is certainly skilled. I would almost go as far as saying that the further back you go, the more a human can gain, because code size, addressing modes, using 'clever' instructions and such make a lot more difference on these old machines.
If you look at an 8088, each byte you save on code will shave off 4 CPU cycles. Huge gains can be made.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 36 of 54, by Scali

User metadata
Rank l33t
Rank
l33t
truth5678 wrote:

That's interesting. It suggests that modern CPUs were partly designed to efficiently handle compiled code. I guess I'm glad for that. 😀

Yup, they are. See also what I wrote here earlier about the Pentium Pro, the first 'modern' x86: Re: Pentium Pro and Windows 95/98 for old DOS games?
This problem rarely occurs in compiled code, since compilers are generally not that clever with partial registers.

An added problem is that compiler technology itself has improved a lot over the years, but it doesn't do older CPU architectures any good.
Each CPU architecture has its own unique optimization rules... This Quake renderer is a good example of that: for a Pentium, it is quite possibly the best perspective-correct texture mapper ever written.
On a 486 it is incredibly slow however.
And you can't really use a modern compiler to target a 486, because it doesn't know how to optimize for that CPU architecture.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 37 of 54, by truth_deleted

User metadata

Thanks for the write-up in that thread. I found that performance problem with PPro, but I didn't know the cause. I did previously compare a few of the older gcc compilers, and it was noted that the oldest of the few performed as well as any in a Pentium/DOS framework, notwithstanding compatibility trade-offs (as in the case of Quake2, where exe and dll are compiled by different gcc versions).

From my limited readings, it seems that J. Carmack was writing 3d engines on a nearly monthly basis, and he must have acquired great insight into all the problems spots, maybe not by theory, but just by practice. The Quake renderer may not be elegant underneath the surface, as far as people have said that; but I think it would be difficult to improve upon, as you noted, without losing too much visual detail.

Another avenue would be if the vQuake code was available. 😀 Although the extra video processing requires the hardware.

Reply 38 of 54, by leileilol

User metadata
Rank l33t++
Rank
l33t++
Scali wrote:

This one? https://github.com/Blzut3/Engoo/blob/master/asm/d_draw16.s
Is that the actual code as written by ID, or has it been modified?

Actual code by id. The (Badly) modified ones are d_draw32.s and d_draw64.s (which despite the name, is draw16 with some lines desperately blindly commented out)

truth5678 wrote:

From my limited readings, it seems that J. Carmack was writing 3d engines on a nearly monthly basis, and he must have acquired great insight into all the problems spots, maybe not by theory, but just by practice.

IIRC most of Quake's 1994-95 devtime was research into portals and polygons, eventually hitting the surfacecached spans drawing later in 95, with production starting then. There's still pieces of stubs and unfinished unusuable code of the old unused renderer in the source release, as well as the high color spans the old 1995 preview .tifs were showcasing, and that's not confusing with the GLQuake code...

apsosig.png

Reply 39 of 54, by truth_deleted

User metadata

That's interesting! I just assumed that his advances were ultimately from watching the effects in a complete 3d engine, but didn't know whether the individual parts of the engine were studied by themselves. I read somewhere once that much of the work (algorithms) was already available in academic circles, but perhaps that is an over-generalization (if even true).