VOGONS


Reply 20 of 43, by kool kitty89

User metadata
Rank Member
Rank
Member

Just stumbled on this old thread yesterday:
http://www.siliconinvestor.com/readmsgs.aspx? … &batchtype=Next

There's a decent explanation of the oddities of Quake's P5-specific assembly language optimization that I hadn't seen before.

Apparently, Quake uses the FPU registers as a buffer for string handling, both increasing parallelism of the code and making better use of the 64-bit bus width (integer string handling is limited to the normal set of 32-bit registers, at least priot to MMX's introduction). It actually sounds sort of like a software-hacked MMX-ish workaround.

This also explains why Pentium-specific speed continues to scale at higher resolutions where fillrate is a bigger bottleneck than vertex rate.

This not only speeds work up on the P5 architecture, but hinders performance on 486, 5x86, 6x86, K5, K6, and everything else lacking a pipelined FPU. (a rather SLOW/weak pipelined FPU would do very well here as it's all about I/O performance, not computational grunt -also hence why Quake performance doesn't match up at all with raw FPU benchmarks, aside from maybe some with very very P5-specific compiled or hand-written code) The P6 uses a smart pipeline sort of arrangement that would be just about as fast on non-P5 optimized FPU code as it is on P5 code.

You don't need pipelined FPUs to allow parallel operation with the ALU either, just decently interleaved FPU calls allows that even on the 486, though the 6x86 (and I think 5x86) adds a FIFO buffer to speed this up a bit more and the K5 and K6 had very fast FPUs in general, faster than the P5 (per clock) for some things. (the K6 FPU has lower latency on a lot of operations than the P6 FPU too) I'd assume that heavy optimization (including on the compiler and API/driver end) for P5 CPUs hindered the real-world performance of non-intel parts at least prior to the K7.

Quake optimized for 486 probably would've run better on everything but P5 pentiums, including making it more playable on 486s.

Now to wonder how games might have performed if they used the 6x86's L0 scratchpad. 😉

Edit: more on the topic at hand: does the motherboard being used support 50, 60, or 66 MHz operation? (with PCI bus divider) Dropping the multiplier to 2x and using a faster bus should speed up a lot of things (so long as the board can handle it, and the cache SRAMs). 66x2 would be ideal, not just for speeding up CPU I/O, but also PCI DMA operations (both throughput and latency should be improved due to faster/tighter DRAM timing at the very least)

I suppose 40x3 would be worth trying too. (if you get speed gains from THAT, it's obviously gains in I/O performance -and the overclocked PCI bus)

Reply 21 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
kool kitty89 wrote:

Apparently, Quake uses the FPU registers as a buffer for string handling, both increasing parallelism of the code and making better use of the 64-bit bus width (integer string handling is limited to the normal set of 32-bit registers, at least priot to MMX's introduction). It actually sounds sort of like a software-hacked MMX-ish workaround.

That is not entirely correct.
It is true that a memcpy() on a Pentium is fastest when using FPU loads and stores.
It is trivial to replace it with rep movsd, which should be fastest on 486. But that won't get you far.

Another big issue with Quake is that it exploits the fact that the FPU is pipelined. You can fire off an fdiv instruction, and run integer code in parallel. Then you can do all sorts of work, and have your fdiv result ready by the time you need it.
They use this for the perspective correction division. A 486 FPU simply cannot do this, and most 'Pentium' clones such as AMD and Cyrix basically just use a 486-class FPU, with little or no pipelining. Aside from that, the fdiv instruction itself takes a lot longer because the circuit is less optimized than the Pentium one. So the result simply isn't ready by the time the code expects it to be, and you get stalls.
This cuts greatly into performance.

kool kitty89 wrote:

This also explains why Pentium-specific speed continues to scale at higher resolutions where fillrate is a bigger bottleneck than vertex rate.

Well no, the perspective division is done every 16th pixel, so this scales up with resolution as well.

You basically can't do this on 486. You'd need to simplify the texture-mapping. A coarser approximation of perspective is the only solution (something like Descent). This basically means rewriting the whole texturemapper for 486. By doing perspective correction only every 32nd pixel or so, and/or by using fixedpoint integer code instead of FPU code to speed things up. This has never been done because it requires someone with the same skills as Carmack/Abrash to optimize it to the same level as the Pentium-version. Not the kind of people who would work on this sort of project.

kool kitty89 wrote:

Quake optimized for 486 probably would've run better on everything but P5 pentiums, including making it more playable on 486s.

Problem is, Quake optimized for 486 isn't Quake as we know it.
What set Quake apart was its super high accuracy in the renderer, with subpixel/texel correction and pixel-perfect perspective. You simply can't do this on a 486. You have to cut corners. So either you go for something like Descent, which runs fine on a 486, but is somewhat more 'primitive' visually than Quake is, or you go all-out and you get Quake, where the software renderer is basically 'perfect'. The results are as good and accurate as any 3d accelerator (if you don't apply texture filtering of course).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 22 of 43, by kanecvr

User metadata
Rank Oldbie
Rank
Oldbie
kool kitty89 wrote:

Edit: more on the topic at hand: does the motherboard being used support 50, 60, or 66 MHz operation? (with PCI bus divider) Dropping the multiplier to 2x and using a faster bus should speed up a lot of things (so long as the board can handle it, and the cache SRAMs). 66x2 would be ideal, not just for speeding up CPU I/O, but also PCI DMA operations (both throughput and latency should be improved due to faster/tighter DRAM timing at the very least)

I suppose 40x3 would be worth trying too. (if you get speed gains from THAT, it's obviously gains in I/O performance -and the overclocked PCI bus)

I already run at 40x3=120MHz. The CPU in question is a Cyrix 586 100GP. I can also run it at 50x2 but performance drops in most games. The motherboard does support 60x2 but unfortunately the CPU does not. It will hang during post at that fsb value.

Reply 23 of 43, by kool kitty89

User metadata
Rank Member
Rank
Member
Scali wrote:

Well no, the perspective division is done every 16th pixel, so this scales up with resolution as well.

Right, right, I'd forgotten how computationally intensive the perspective correction aspect of Quake was. 3D vertex calculation wasn't an issue as things scale up, but textel and pixel fill rate is totally dependent on computationally heavy stuff too (and taking advantage of FPU pipeline-specific performance boosts). Quake also texture maps EVERYTHING, so no saving graces for large spans of solid-color regions with shading applied. I'm not entirely sure this is the problem with the 6x86 though given the limited amount of buffering the FPU FIFO offers and the fact the Cyrix FPU is faster (per clock) at divides than the Pentium. (multiplication is 4x slower, though, and int/float swaps slower still due to the superscalar operation on the Pentium)

I'd also forgotten that 16-bit integer mul/div on a 486 was faster than the FPU, I'd had 32-bit math on the brain for some reason. (not sure if the 6x86 FPU is faster than 16-bit integers on that CPU, it's around 4x faster than a 486 FPU clock for clock -and faster than a pentium on divides- but the 6x86 is really fast on 16-bit integers too)

You basically can't do this on 486. You'd need to simplify the texture-mapping. A coarser approximation of perspective is the only solution (something like Descent). This basically means rewriting the whole texturemapper for 486. By doing perspective correction only every 32nd pixel or so, and/or by using fixedpoint integer code instead of FPU code to speed things up. This has never been done because it requires someone with the same skills as Carmack/Abrash to optimize it to the same level as the Pentium-version. Not the kind of people who would work on this sort of project.

Yeah, 486 class texture mappers use affine texture mapping or something similar (doom does that for column rendering too, but you never see perspective warping as you're drawing columnd, not lines and using spans, not triangles -worst warping is usually on triangle pairs for rectangles or larger groups of polygon fans -also something quad-based rasterizers avoided a lot more in typical rectangle-heavy models). You've got more flexibility than on the playstation at least as you can do line/screen sub-division rather than polygon subdivision, but it's obviously still limited.

Tomb Raider is one of the later generation games (released shortly after Quake) that uses that sort of rendering, and Tomb Raider 2 as well. Tomb Raider also uses a good amount of untextured surfaces and variable perspective correction detail levels (using shorter or longer spans/subdivision -or none at all, with full playstation style uncompensated texture distortion). Tomb raider also uses nice, joint-articulated 3D animation for its models rather than the awkward frame by frame 'new model for each pose' route Quake took (and Elder Scrolls Redguard, among a few others ... looks OK if the game runs at or below the animation rate -typically 15 FPS, but otherwise quite awkward). Quake 2 upped that to 30 FPS I think, but it's still noticeable. (Tomb Raider's color use also makes lightng look a lot nicer than in Quake ... rather smooth with far less posterization in spite of not using Tie Fighter style dithering -Tomb Raider 2's 256 color renderer posterizes badly by comparison too ... though ironically supports some really high resolutions, just not a game well suited to 256 colors -Quake II strained things a bit too but actually not as bas as Quake I at times from what I recall ... maybe just because there were fewer super low-light areas -maybe they should have used more full-screen palette swapping to do room-wide lighting with wider dynamic color/shade range in Quake I)

I'm still not sure Tomb Raider even makes use of the FPU given it seems to be more 486 optimized and the PSX and Saturn versions run on 16-bit integer math for their vertex plotting too, so precision wouldn't be any worse doing that in software on a 486. (plus I've had anexcotes of people running Tomb Raider on a 386, presumably without co-pro ... albeit slow and postage stamp sized screen, but still running) I don't have any FPU-less systems to test this out on, though.

I suspect Wipeout may avoid FPU utilization too, though that was a 1995 release. (the design of the game also tends to avoid surfaces that would obviously show warping/distortion ... Tomb Raider's cavern/dungeon style maps are pretty much the absolute worst cases for that, worse than Quake's more rectangular/right-angled level/model design)

And of course the Saturn release of Quake wouldn't be able to use pentium style perspective correction either. (I think that had it even worse since they didn't adapt the model to use quads, instead pre-warping all textures to fold into triangles -also wasting about half the rendering bandwidth in the process) The unreleased PSX version of Quake supposedly had the rendring bandwidth to manage 60 FPS, but CPU performance dropped game speed to 30 FPS (with actual game-logic/hit detection/AI/etc thrown in) but that does imply they had a LOT of room to sacrefice GPU speed for subdivided polygons, had they actually released that version. (I assume that work ended up rolled over to the Quake II release instead -Quake is one of those really odd exclusives for the Saturn given its struggling position on the market)
The Saturn's dual CPU arrangement might have allowed some quake-like perspective correction through a software renderer though. (or more likely, software rasterizer using VDP1 to fill lines of textels with the CPUs setting up each line, hardware texture mapping with CPU driven polygon set-up -the saturn ALSO has a DSP intended for vertex computations, though divides are slow ... so they could've had that handle the 3D math and a CPU handle the perspective correction ... lots of options if the system was exclusively optimized for by hand -the SH2s can also halve their caches and use 2 kB for scratchpad space, so some neat possibilities there)

Obviously, custom optimizing for the 6x86's scratchpad would be similarly proprietary and mean Id would have to re-write processor-specific renderers to account for that. (so a lot of work for a niche market ... catering to 486s/5x86s would make a lot more sense, business wise)

Problem is, Quake optimized for 486 isn't Quake as we know it.

Yes, I'm describing an entirely new game engine, and not really saying it'd be a worthwhile hobby effort. My comments were more musing on what Id (or some outsourced 3rd party) might have done back in 1996/97. Most likely id themselves, and mostly likely written in parallel with the Pentium version and designed to use the same data, just a different renderer. (Carmak did, after all put the effort into writing Doom for the Jaguar and writing an entire compiler and toolchain for that oddball proprietary RISC architecture -not really that ODD but quirky- and had started developing Quake for the system before support was discontinued ... so writing a 486 renderer in parallel with the Pentium one shouldn't have been THAT big a deal -a 6x86-specific scratchpad utilizing engine seems less realistic due to the niche nature and much later release date of that CPU ... plus how performance should've been pretty well adequate using the 486 engine AND CPUid detecting the 6x86 as a 486 anyway) And again, more likely to omit FPU utilization entirely and go 16-bit for best speed on a 486.

The results are as good and accurate as any 3d accelerator (if you don't apply texture filtering of course).

Sure, with correspoinding performance trade-offs taken into account, and lesser flexibility (I don't think Quake's method would easily allow variable perspective span widths for lower or higher presision rendering and variable detail levels like Tomb Raider offered). Plus potential trade-offs like using a highcolor mode with nicer lighting/shading. (also something a 486 could more likely handle, at least at low resolutions -the GBA and Sega 32x does that all in software on 486SX class CPUs or worse -32x is probably more like using a 512kB ISA SVGA card + 486SLC-50 or 66 due to the 16-bit bus and dual 23 MHz CPUs)

That said, there IS interest in writing an engine capable of playing a conversion of Quake on the 32x ... with further plans to adapt that engine into something more original.

Also yes, I know 16-bit RGB has problems with alpha blending that are tough to work around using LUTs, but there's workaround allowing soft-SIMD type arrangements or a few other things ... or using checkerboard dithering like Unreal did. (which was a really really odd move there for 1998, especially with 32-bit color support -which makes software blending easy due to 8-8-8-8 boundaries, and MMX support on top of that) 24/32-bit color would've been nice, but the bandwidth on typical 1996/97 systems might've been too much to be worthwhile for Quake, that and not all SVGA cards supported truecolor. (though 320x240 truecolor would be pretty nice for Quake ... or Tomb Raider) Lighting in highcolor or truecolor is easy using the same sort of 256x32 LUT system 256 color quake does. (just mapping to a 16 or 24/32-bit destination instead)

Hmm, scratch the bandwidth issue ... 320x240 truecolor would use the same or less bandwidth as 640x480x8bpp, and 320x240x16-bit would use the same as 320x480x8-bit quake. (truecolor support market share would be more an issue, but 16bpp modes were pretty universal in SVGA) Then again, it seemed like EVERYONE avoided using >8 bpp software renderers prior to MMX, and even then kinda sketchy. (Tomb Raider II's DirextX sofware renderer was really lazy) Unreal's dithered effects would've been way more excusable in 1995/96 in highcolor games ... and faster than the 256x256 LUT method. (Unreal's odd pseudo texture filtering also would've been more excusable applied to a 256 color renderer -indeed that feature HAS been added to 256 color quake or quake II or both -I forget- ... in Unreal it just looks odd much of the time and with the often fairly high-res textures, worse than no filtering at all)

But as far as Quake aesthetics go, I personally can't get over how heavily posterized things get and now low-color a lot of the textures end up shading down to so often. (it's NOT a problem with the dark/dingy theme or texture colors themselves as things look way nicer in GL Quake using the same textures) Descent and Doom look nicer in that respect, Tomb Raider more dramatically so (given how much more shading/lighting there is to actually compare -Doom and Descent are a bit cheat-y with their simplicity) though Tomb Raider's level/model design also makes perspective errors way more obvious than Descent. (I don't recall ever noticing it in Descent without trying really, really hard to make it happen -of course Tomb Raider has additional difficulties with the 3rd person camera ... which itself makes the perspective errors yet more obvious -Quake's limited PoV makes what little errors remain pretty much invisible ... unless you use mouse-look like crazy and play in 320x200)

Oh, I nearly forgot one other suggestion that came up years go when discussing software rendering methods: you could use Doom style column rendering for polygons with no added perspective correction and get way way less perspective warping in most situations (as typical movement and PoV/camera angles are much more in line with columns than lines). You also lose a lot of speed there as far as fillrate given pixel lines tend to be in the same DRAM pages vs columns much more often breaking pages and incurring waits. (you also have to render one pixel at a time unless you do some fancy buffering tricks, so no advantages of a 64-bit bus ... or 32-bit even ... or 16-bit ... as far as pixel writes go -unless you render the entire thing sideways and then rotate it 90 degrees ... and use a lot more overhead on that ... well ... fast-rotating in 4x4 -32-bit read/writes- or 8x8 -64-bit read/write- pixel blocks might be pretty fast, the latter would fill all 8 FPU registers though ... )

Edit: might as well throw in a link to the Saturn version for example.
https://www.youtube.com/watch?v=RI0Q0VWOp3E

Some texture warping/twitching yeah, but not really on a distracting level. (the Saturn's lighting is a bit weird, a lot of games have a gama adjust but that's still limited to 'too black' or 'too gray/white' as shading desaturates most colors rather than working on a proper multiplicative RGB basis -it uses a combination of ROM LUTs and addition/subtration to the RGB values I believe, and games not optimized for its quirks tend to suffer a bit -Tomb Raider has this problem worse than Quake, or at least the difference is more obvious compared to the PSX and PC Tomb Raider variants ... GL quake might be as dramatic)

Reply 25 of 43, by sirlemonhead

User metadata
Rank Member
Rank
Member

Yeah the Saturn games made with the lobotomy engine haven't got anything done to fix texture warp, that I can recall - The warping just doesn't show up as much with the rendering method the Saturn uses.

Reply 28 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
x86++ wrote:

For others who are optimizing quake with source code, the above theories about quake bottlenecks are unsupported by real testing, particularly the perspective correction.

Says who?
I've done a lot of software renderers back in those days, for 486, Pentium, Pentium II etc, filled with various assembly optimizations.
I am talking from personal experience, not from 'theory'.
A (subdividing) affine mapping renderer was indeed my 'weapon of choice' for 486, because perspective correct texture mapping was not possible without shortcuts (such as Doom, which iirc used table lookups for the floors, because it only had limited degrees of freedom).
This is what I developed for 486 back in the day: https://youtu.be/xE9iifKXvY4
I wanted to keep the subpixel/texel correction in a 486 renderer, because I feel it adds to the perceived quality and smoothness (Tomb Raider doesn't have that, and looks rather shaky). But I did not use the FPU for most maths, because fixedpoint was faster on 486. The matrix calculations, polygon clipping etc are also done entirely in integer.
It also uses z-sorting rather than z-buffering, because of the much lower memory bandwidth on a 486 compared to Pentium, which makes z-buffering very expensive.

I later scaled it down even further, while still keeping the subpixel-correction, for a renderer aimed at 286: https://youtu.be/4ClrU-ne2Us

And of course, more recently, that renderer was scaled down to 8088: https://youtu.be/hNRO7lno_DM?t=5m40s

But I guess that's all 'theory'.

Last edited by Scali on 2015-12-11, 15:41. Edited 2 times in total.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 30 of 43, by kanecvr

User metadata
Rank Oldbie
Rank
Oldbie

So... it's theoretically possible to optimize dos_quake and gl_quake for a 486? The best It can do on my 586 is 20.1 fps in gl_quake with -particles 0, gl_flashblend 0 and r_dynamic 0 + a voodoo rush. Using a v1 is somewhat slower at 18-19.

Reply 31 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
kanecvr wrote:

So... it's theoretically possible to optimize dos_quake and gl_quake for a 486?

Software rendered Quake moreso than GLQuake.
The problem with GLQuake is that it runs on top of MiniGL, so you can't really escape using the FPU for geometry processing, because floating point is what MiniGL expects.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 33 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
386SX wrote:

Off topic: and what about optimizing Doom itself for the 386? Did anyone tried some different coding as the various port (as the Jaguar)?

I suppose Doom is already optimized for 386.
It doesn't require an FPU, and otherwise the code doesn't really have anything specifically bad for 386 or particularly good for 486. It runs very well on both, a 486 is just faster.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 34 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
kool kitty89 wrote:

Yeah, 486 class texture mappers use affine texture mapping or something similar (doom does that for column rendering too, but you never see perspective warping as you're drawing columnd, not lines and using spans, not triangles -worst warping is usually on triangle pairs for rectangles or larger groups of polygon fans -also something quad-based rasterizers avoided a lot more in typical rectangle-heavy models).

Yes, Doom uses the same trick as Wolf3D for the walls: you just raycast them, and draw scaled columns. The perspective is solved by the raycasting part.
For the ceilings and floors iirc they have precalced the perspective in a table.
So they take advantage of the limited degrees of freedom in Doom to speed up the rendering (the rendering accuracy is pretty much 'perfect', like Quake).
But then again, the biggest change from Doom to Quake is that you got full 3d.
Doom pushes a 486dx2-66 to the max, so if you want full 3d as well on such a system, you''ll have to trade in performance and/or quality. Which gives you something like Descent.

kool kitty89 wrote:

Tomb Raider is one of the later generation games (released shortly after Quake) that uses that sort of rendering, and Tomb Raider 2 as well. Tomb Raider also uses a good amount of untextured surfaces and variable perspective correction detail levels (using shorter or longer spans/subdivision -or none at all, with full playstation style uncompensated texture distortion).

Tomb Raider has a horribly inaccurate software renderer though. It's neither very fast nor very stable. It does about the same as Descent does, but I think Descent does it somewhat better.

kool kitty89 wrote:

Tomb raider also uses nice, joint-articulated 3D animation for its models rather than the awkward frame by frame 'new model for each pose' route Quake took (and Elder Scrolls Redguard, among a few others ... looks OK if the game runs at or below the animation rate -typically 15 FPS, but otherwise quite awkward). Quake 2 upped that to 30 FPS I think, but it's still noticeable.

Mind you, Tomb Raider does not perform any skinning, so you get very 'blocky' movement of limbs. Quake is smoothed out. I think Half-Life may have been the first to perform realtime skinning, giving you the best of both worlds. But by then we were well into Pentium territory.

kool kitty89 wrote:

Yes, I'm describing an entirely new game engine, and not really saying it'd be a worthwhile hobby effort. My comments were more musing on what Id (or some outsourced 3rd party) might have done back in 1996/97. Most likely id themselves, and mostly likely written in parallel with the Pentium version and designed to use the same data, just a different renderer.

Well, I would say that they already passed the 486-station with Doom. Apparently that's what they did on a 486. Preferring image quality and smooth framerate over 'true' 3D.

kool kitty89 wrote:

(I don't think Quake's method would easily allow variable perspective span widths for lower or higher presision rendering and variable detail levels like Tomb Raider offered).

The problem with variable span widths is that it introduces conditional code. Quake deliberately only uses 16-pixel perspective correction because it gives good enough results, and you can unroll the loop 16 times. There's no point in doing longer spans, because the 16-pixel span is already 'perfect' for a Pentium: it does the fdiv fast enough. Less code also means that it will fit in L1-cache better.

So for 486 you'd probably only want a 32-pixel span loop unrolled. It's probably going to be quite a bit slower if you have multiple types of spans, and the logic to pick the right one.
Either that, or saying 'to hell with perspective!', and just do affine texturemapping, with triangle subdivision to somewhat limit the distortion. But you should do this with subpixel/subtexel accuracy to maintain stability (that was my aim). That is something that Tomb Raider does not do (wasn't possible on PlayStation, but they could have done better than that on PC).

Last edited by Scali on 2015-12-14, 12:48. Edited 2 times in total.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 36 of 43, by kool kitty89

User metadata
Rank Member
Rank
Member
leileilol wrote:

Neat project there, even if it didn't get all that much attention.

Though one odd (or just insteresting) thing I noticed even looking at 486 performance alone was some pretty substantial computational execution time differences between AMD and Intel parts, at least going by the Passmark results in the 686 benchmarks. The Ultimate 686 Benchmark Comparison

Even taking the faster FSB used for the Intel test (i486DX4/100 set to 2x66 vs AM5x86 @4x33 and 4x60), nearly all the ALU and FPU tests have proportionally similar performance except the integer multiply on the Intel chip. Intel's integer multiply is way faster than AMD's, faster than its FPU multiply as well, and as fast as a Pentium 90's integer multiply. I'm fairly certain those tests are all working on 32-bit integers and single-precision floating point data, so any speed differences in 16-bit integer performance isn't reflected there.

That said, it's interesting to note that integer multiplication performed poorer than FPU multiplication on all the 1995 CPUs except Intel's own 486. (Cyrix 5x86 and 6x86 both had faster floating point multiply, and of course Intel's Pentium did as well, and the K5 has extremely fast integer performance, but wasn't on the market until 1996) I'm not sure of the NX586's performance, but given it had a separate/optional FPU, anything optimizing code for it would be pure integer anyway)

Slow FPU add/subtract/xch performance might not make using the FPU worthwhile on most of those, though (other than Pentium). Integer divide performance is also much faster than FPU divide on all of those non-pentium processors (and 486s showed similar clock for clock integer divide to the Pentium while Cyrix 5x86 and 6x86 both excel in divide performance -even FPU divide, though integer is still faster). So a perspective correct software texture mapper should at least handle its computation on the integer end. (the 3D matrix computations -more heavily multiplication bound- might do better on the FPU, especially with some integer parallelism in play)

The AMD K5 is also horrible at division, FPU worse than ALU, but it's bad at both and probably the processor that would benefit most from having perspective correct rendering disabled in Quake. Its FPU is actually very fast outside of division, faster clock for clock than the P5 or K6, though its blindlingly quick integer performance still marginalizes that, it very much is a stark contrast to Cyrix's performance strengths and weaknesses. (the Fadd/subtract/mul performance of a 90 MHz K5 matches that of a Pentium 100, but the Fdiv performance is worse than a 486DX-100 or Pentium 60) Granted, this is also the later revision K5 with boosted integer performance, the earlier K5 (with PR rating equal to clock rate) wasn't tested.

The Cyrix FPU also has a 4-deep FPU command buffer as well as a 4-deep FPU store buffer, so overlap with ALU operations should be fairly good. The generally slow (4-cycle peak throughput) add/subtract/multiply/xch operations would all be pretty big bottlenecks for a quake style renderer though. (had multiply been the only slow one, it probably would have been a great deal more competitive with the Pentium)

And while Cyrix's scratchpad function would be potentially good for an integer pixel line/span buffer, that feature was unique to the 6x86 at the time and obviously too special-case to optimize for. Having a pixel buffer in main memory location that simply cached very well (very high L1 hit rate -even on 8k 486s) would seem like one of the better options to target. Then again, it might not have been too difficult to offload that pixel buffer to the Cyrix scratchpad if the renderer was already integer optimized. (make it an optional setting with code modified to target the local scratchpad address rather than main RAM, or even better if the 6x86 was auto-detected)

Had NexGen made a socket-5 compatible NX586SX of sorts (otherwise standard 32kB cache NX586 just with Socket 5 compatible I/O logic and pinout) it probably would've been popular enough to make a lot of the early FPU-requiring game target integer operations instead. (and probably would have made NexGen CPUs a hell of a lot more popular than their proprietary format had done ... not that the vast majority of users even needed the FPU anyway) Then again, I suppose 486SX-100/120/133 MHz CPUs would've had a similar impact had they existed (and been significantly cheaper -and more popular- than the DX counterparts) or K5s or 6x86s. (or 5x86s for that matter -particularly with the FPU taking up a much larger percentage of die space on the 5x86 than 6x86 -same FPU, much smaller integer section)

sirlemonhead wrote:

Yeah the Saturn games made with the lobotomy engine haven't got anything done to fix texture warp, that I can recall - The warping just doesn't show up as much with the rendering method the Saturn uses.

Yeah, and I think I'd misinterpreted some of the interviews on the development there too. They mentioned having to pre-warp textures and fold quads into triangles, but this was probably limited to models that HAD to use triangles (especially enemies and weapons) while the world models in Quake (and Duke Nukem 3D) were pretty much entirely made of quads anyway. From THAT perspective, Quake's software renderer probably could have avoided the need to perspective correct at all by rasterizing quads and triangles selectively. (having specific rendering code for rectangular polygon strips) Plus even when you DO distort textures on quads, it ends up less fish-eyed than on tiangle strips/fans (Tomb Raider shows this a lot). The Saturn's quads had the problem of being drawn warped rather than rasterized (linear 'sprite' texture read and nonlinear writes with overdrawn when quads are folded at all) which reduces fillrate and creates the translucency 'bug' for 3D quads (textured or not, both types have the overdraw problem where pixels get blended multiple times and don't produce proper even translucency). IIRC Quads look perspective correct so long as the texture is square or rectangular when shown flan/face-on, but if it's trapezoidal, rhomboidal, or triangular, it can look fine face-on (perpendicular to camera) if the texture is pre-warped appropriately, but distorts more and more at other angles.

Rasterizers (hardware or software) don't deal with that sort of problem, so Quake-style translucency would work just as well for quads. (the 3DO might have had the same problem with its quads, I'm not positive, but the Jaguar software rasterized everything -blitter can just draw lines- so no blending problems plus a derivative of a quake-like perspective correct texture mapper would have been possible there, particularly since bandwidth/fillrate was the bigger bottleneck than computational overhead -so some added division computations probably wouldn't have hurt actual rendering speed a lot, granted all integer stuff ... floating point emulation on the J-RISC GPU would definitely slow things down)

The Playstation only renders affine-mapped triangles (even its sprite mode uses triangle pairs, just with no vertex data required) so translucency works fine but the only option for perspective correction is using more (smaller) polygons to minimize affine rendering errors. (this becomes a bit of a problem for Doom -or SNES Mode 7, etc- large, flat plane floors that are easy to render as roated quads with scaling set per line -SNES used raster interrupts to do that- but the Playstation needs to chop a plane up into tons of tiny polygons to approximate this rather than just rendering one big square texture)

x86++ wrote:

For others who are optimizing quake with source code, the above theories about quake bottlenecks are unsupported by real testing, particularly the perspective correction.

Disabling perspective correction alone might not have a big impact on most cases on its own given raw FPU bandwidth and throughput are the bigger bottlenecks for non-pentiums EXCEPT for the K5. The K5 should see a substantial gain (to the point of possibly being faster clock for clock than the P5 -particularly given how close it already runs it). The K6 might gain a bit too, but its still bound by 2-clock throughput limits on most operations (the CXT core K6-2 made Fxch as fast as the Pentium, but didn't speed up anything else, plus from all I've managed to gether it doesn't have FPU prefetch or other buffers/queues akin to what Cyrix used other than perhaps for a single-issue FIFO; FPU/ALU parallelism seems to be handled more by the RISC86 scheduler than buffering in the K6) The K5's FPU seems fast enough that, regardless of buffering/pipelining it would be very fast/competitive (especially with the fast ALU minimizing latency) with the exception of the slow divide which would seriously benefit from overlapping execution. (or better avoided in general, or perhaps even better offloaded as an integer divide and conveted back to floating point number -given Fxch is quite fast on the K5)

That said, throwing out the FPU driven span renderer entirely should have a huge impact on every non-Pentium processor around in 1996. All of them either had slow FPUs or just really fast ALUs (or 32-bit main/L2 busses bottlenecking FPU operations more than integer ones) so switching to an optimized all-integer renderer would help a lot. (it would even help a lot if the vertex calculations were still offloaded to the FPU)

An integer-optimized render would probably have been easier to partially P5 optimize than trying to non-pentium optimize Quake's existing renderer. (given the P5's fast Fxch, you could detect the pentium and use modified code that offloads multiplication and division operations to the FPU, but converts them back to integer format before final use -so effectively using the FPU as an ALU-accelerator)

Reply 37 of 43, by kool kitty89

User metadata
Rank Member
Rank
Member
Scali wrote:

This is what I developed for 486 back in the day: https://youtu.be/xE9iifKXvY4
I wanted to keep the subpixel/texel correction in a 486 renderer, because I feel it adds to the perceived quality and smoothness (Tomb Raider doesn't have that, and looks rather shaky). But I did not use the FPU for most maths, because fixedpoint was faster on 486. The matrix calculations, polygon clipping etc are also done entirely in integer.
It also uses z-sorting rather than z-buffering, because of the much lower memory bandwidth on a 486 compared to Pentium, which makes z-buffering very expensive.

Per-polygon Z-sorting using a painters algorithm (and likely ray-casting to organize that list and minimize overdraw) would tend to be the most typical and efficient arrangement. Per-pixel z-buffering can limit overdraw even more, but at the expense of eating up a lot more bandwidth just handling the Z-buffer. (also probably would've been nicer for hardware accelerated detail options to include Z-buffer vs software Z-sorting given how many early accelerators did slow Z-buffering and/or lacked the RAM to afford it ... or drivers where z-fighting is so prominent that painters algorthim type sorting would look a hell of a lot nicer; Tomb Raider II had a really nice variety of settings for its Direct3D renderer, but it doesn't seem to have been common at all)

Neat that you managed good sub-pixel accuracy there. I've had a few discussions with Playstation homebrew developers that mention the GTE (16-bit fixed-point geometry DSP) is capable of pretty decent sub-pixel accuracy as long as you make sure to use the right rounding rules and output the correctly adjusted vertex data to the GPU. Otherwise you get seams, and this is a problem that can persist even at 32-bit precision and with floating point calculations too. (a lot of it seems to be about understanding the behavior of the GPU -an issue that also crops up on a lot of Direct3D drivers for early 3D accelerators ... the Rage Pro still has noticeable seaming and jittering/sliding issues in some games)

Some Playstation ports also seem to suffer from software Z-culling/clipping optimized more for TV style overscan, leaving visible polygon drop-out in the PC version. (or at least it seems more visible than on the playstation itself)

Scali wrote:
386SX wrote:

Off topic: and what about optimizing Doom itself for the 386? Did anyone tried some different coding as the various port (as the Jaguar)?

I suppose Doom is already optimized for 386.
It doesn't require an FPU, and otherwise the code doesn't really have anything specifically bad for 386 or particularly good for 486. It runs very well on both, a 486 is just faster.

The column renderer method might favor the 486 a bit more than polygon renderers of the same period (with or without texture mapping) due to making better use of the 32-bit wide bus/registers (or even a 386SX's 16-bit bus to some extent -given you still have 32-bit registers to buffer into and 16 bits is still double the width of what Doom's pixel columns use -lots and lots of 8-bit writes as far as I understand). So X-Wing, Wing Commander III, and Descent (and a few 3D RPGs like Elder Scroolls and Ultima Underworld).

I'd think cacheless 386 systems would show a more dramatic bias here given the bandwidth and latency issues working in DRAM alone. (and gains from making page-mode reads/writes) And a fair number of 386DX40 boards lacked cache or used only 32 kB of cache and most 386SX boards lacked cache. (while an SX33 or 40 would be fast enough to at least handle X-Wing playably and maybe even Descent with the right settings)

Scali wrote:
kanecvr wrote:

So... it's theoretically possible to optimize dos_quake and gl_quake for a 486?

Software rendered Quake moreso than GLQuake.
The problem with GLQuake is that it runs on top of MiniGL, so you can't really escape using the FPU for geometry processing, because floating point is what MiniGL expects.

Wouldn't it work if the MiniGL had been written with integer math in mind in the first place? Otherwise there might be some fixed-point computation schemes that convert relatively quickly to 32-bit float format with less overhead than using Fxch. (aside from the K5 and late model K6-2)

Scali wrote:

Yes, Doom uses the same trick as Wolf3D for the walls: you just raycast them, and draw scaled columns. The perspective is solved by the raycasting part.

One added problem with that is you're stuck with rendering one pixel per framebuffer write rather than potentially buffering spans on a 32-bit register basis or even longer spans still small enough to fit reliably into the L1 cache. Anything nonlinear (ie columns) written to DRAM would be particularly slow given you'd ruin any speed gains from page-mode. (granted, an entire render buffer will often fit into the board-level cache and avoid that problem anyway, particularly at 320x200x8bpp)

For the ceilings and floors iirc they have precalced the perspective in a table.
So they take advantage of the limited degrees of freedom in Doom to speed up the rendering (the rendering accuracy is pretty much 'perfect', like Quake).

They might be, but similar results could be achieved by rendering the floor as a rotated square (or multiple square segments) with scaling factor set only once per line. (with a Wolf3D style game with textured floors/ceiling added, a single square plane could be used given the lack of elevation) Using a pre-scaled table would save time though given the PoV angle is fixed and look-up is faster than hardware computation on the 486 and 386. (I'd assume tables are used for that in games like Whacky Wheels too, and for handling perspective in Mode 7 on the SNES -given how slow the CPU is, especially since the 65816 has fast/low latency memory access and interrupts -so good for both the table optimization and raster-interrupt handling)

Quake could probably get away with setting perspective only once per line (rather than using span subdivision) for quads (ie most/all of the level map) and leave triangles just for the enemies/weapon models (and fire and such) which are non-corrected affine mapped anyway. (I misspoke previously in my post a few montsh back on the gun-texture-warping, I realize those models omit the perspective correction)

But then again, the biggest change from Doom to Quake is that you got full 3d.
Doom pushes a 486dx2-66 to the max, so if you want full 3d as well on such a system, you''ll have to trade in performance and/or quality. Which gives you something like Descent.

Or all the space/flight sims that have relatively little on-screen compared to 3D maze/dungeon type stuff. (texture mapped dungeon crawlers like Ultima Underworld would be closer to Descent there than the likes of X-Wing, Tie Fighter, or Wing Commander III -and the full-screen polygonal stuff on those latter three are all untextured, with the exception of the carrier's flight deck in WCIII)

Descent does manage to look a lot nicer in texture perspective and sub-pixel allignment than Tomb Raider and without the limited PoV of Doom. (it might actually be more playable on a 386 than Doom is, especially comparing Doom in high detail mode) Tomb Raider is way more CPU intensive, obviously, but there's a hell of a lot more being drawn on-screen there. (both in terms of vertex count and pixel/textel drawing)

Tomb Raider has a horribly inaccurate software renderer though. It's neither very fast nor very stable. It does about the same as Descent does, but I think Descent does it somewhat better.

It has just about the same problems as the Playsation version of Tomb Raider does. (Tomb Raider 2 got a bit better on the Playstation though the PC software renderer doesn't support perspective correction at all and the palette works far worse for shading)

Descent seems to be a better example though, and uses much more quake-like level design. (Saturn Quake seems to be a good example of what a software quad+triangle rasterizer could have done without any span subdivision -just relying on affine mapping being naturally more correct on single-piece quads than 2-piece triangle strips/fans -not sure which term is applicable to 2-triangle primitives like that)

I wonder if Descent actually used a quad renderer ... obviously you need to use triangles in the geometry calculations (and all quads will end up as 2-triangle strips on the point-plotting end, but treated as single quadrilateral primitives by the rasterizer) Doing fully warped quads is more complicated too (Saturn/3DO style) but I think there's some simplifications there too, like limiting quads to squares/rectangles/rhomboids/trapezoids where at least 2 of the lines are parallel. (and any model primitives that don't fit the trapezoid limit can just fall back to 3-point polygons anyway, or even use the same rendering code and 'attach' two of the quad points to fold it into a triangle -like Saturn games do, but without the overdraw issue) That's simpler on a hardware design end too and is actually what the (unreleased) Jaguar II chipset does for its 3D primitive rasterizer. (the blitter can render trapezoids, but not free 4-point quads -I think it better matched the existing line-drawing algorithms the Jaguar Blitter supported and still offered more flexibility than a fixed-function triangle rasterizer)

Honestly, 3D projected/rotated rhomboids/trapezoids cover pretty much all the common instances where quads are used in 3D models anyway and where they're more useful than triangles. (the most obvious distortion in Tomb Raider -and other affine mapped triangle renderers- is on large rectangular surfaces anyway -which when projected in 3D, still end up as trapezoids anyway; unless my grasp on 3D perspective is totally off base here, or at least approximated 3D perspective using vanishing point allignment) Or ... maybe it only works with a fixed PoV like Doom (which Descent does to an extent as well -I don't recall being able to roll) so everything has parallel vertical wall-edge allignment). More skewed perspectives would make simple vanishing point style perspective non-functional or at least full of errors, so you'd need to resort to 4-point fully warped quads or triangles. (with the Jaguar II, triangles would be the only option and quads like that would need 2-triangle strips, or resort to GPU-assisted line-list style rasterization using quads, which would probably still be faster than software/GPU sub-divided texture spans -and note 'GPU' refers to the embedded RISC MPU, not the blitter or object processor)

Mind you, Tomb Raider does not perform any skinning, so you get very 'blocky' movement of limbs. Quake is smoothed out. I think Half-Life may have been the first to perform realtime skinning, giving you the best of both worlds. But by then we were well into Pentium territory.

Ironically, you end up with animated models close to the camera a lot more in Tomb Raider vs Quake, so the trade-offs might have better matched the two games if they'd swapped their animation styles. (OTOH you really need a high framerate running/walking animation for the player model at the very least, this ended up looking quite a bit weirder than Quake when Bathesda used it in Redguard Adventures -reserving enough memory to have a more fluid animation was probably a better solution than articulated models given it would only be for the player and not other animated models; plus the camera has a stiffer/tight chase view on the player compared to Tomb Raider, so the choppy walking -or step climbing- is even more obvious ... especially for a 1998 game with 3DFX support that otherwise looks rather nice ... well aside from the flat shaded lighting in the software renderer)

Well, I would say that they already passed the 486-station with Doom. Apparently that's what they did on a 486. Preferring image quality and smooth framerate over 'true' 3D.

In that case, Quake likely would've ended up designed around Duke Nukem 3D style portal limitations to work around the height-map level design. But column renderers are inherently slower than line renderers due to the inability to make use of multi-pixel writes (unless you render multiple columns simultaneously, which gets messy quickly). So a true 3D-space vertex plotted polygon engine COULD be faster than a Doom style column/span renderer if other trade-offs were made. (Doom on the Saturn and 3DO probably would've been a hell of a lot faster as a 3D engine than a ray-cast height map, and the Playstation port converted it to just that ... the Jaguar port might have been faster as a polygon engine for that matter -likewise Descent would have likely run faster on the Jaguar than Doom did)

Plus, 3D game aesthetics can be optimized to use textures more selectively rather than texture mapping every single thing, and speeing up rendering quite a bit by avoiding the need to fetch textels. (using ray casting to depth-sort detail levels, and rendering untextured, shaded models -possibly with lower polygon count- in the far distance could certainly have been one way to speed things up significantly, especially on more bandwidth bound hardware) Tomb Raider probably should have done that to speed up rendering. (it does due fade-to-black distance fogging and limited draw distance, but I don't think it cuts out texture mapping entirely)

So for 486 you'd probably only want a 32-pixel span loop unrolled. It's probably going to be quite a bit slower if you have multiple types of spans, and the logic to pick the right one.

I suppose all or nothing perspective correction would be somewhat more reasonable (just doing affine rendering on everything in a low-detail mode) but aside from the K5, I'm not sure how much that would help Quake's renderer given the other floating point operations are so much slower on a 486, 5x86, or 6x86 than a Pentium, pipelining or no. (on top of the 32-bit bus limit on 486 systems -even if you push the bus/L2 up to 66 MHz, the FPU bandwidth isn't remotely close enough ... albeit probably more of a slow FPU limit than 32-bit bus) Multiply and divide are the operations the Cyrix FPU really does better than a 486DX (and divide is as fast or faster than a Pentium) the slow Fxch doesn't even make int/float swaps a viable workaround for speeding up non-pentiums with the exception of the K5. (the K5 would probably be pretty damn fast at Quake if it used the FPU registers as buffers and did all the multiplies and divides on the ALU)

Even the K6 (and probably NX586 FPU) had a 2-cycle Fxch up until the CXT revision K6-2 (which came around after Quake 2's release so there was no incentive to even consider offloading floating point operations to the integer end and swapping back and forth due to Fxch overhead ... aside from the very limited marketshare of K5 users)

Either that, or saying 'to hell with perspective!', and just do affine texturemapping, with triangle subdivision to somewhat limit the distortion. But you should do this with subpixel/subtexel accuracy to maintain stability (that was my aim). That is something that Tomb Raider does not do (wasn't possible on PlayStation, but they could have done better than that on PC).

The sub-pixel accuracy issue seems independent of texture mapping entirely and also doesn't seem that common on the whole (you don't see gaps forming in big ship models in X-Wing or Tie Fighter ... aside from the camera clipping through models entirely in 3rd person mode or durring collissions) and it seems an odd coincidence that it cropped up around the time the Playstation became popular (and was a common -but not absolutely necessary- problem on that platform as well)

Also odd that seams seem to show up quite often, but overlapping polygon edges don't (polygons clipping through eachother where they meet). If it's a matter of rounding vertex data causing THAT instead of open seams, they made a bad decision given slight single-pixel clipping/overlapping like that is far less noticeable than the seams. (unless both happen and only the seams are noticeable)

Reply 38 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
kool kitty89 wrote:

That said, it's interesting to note that integer multiplication performed poorer than FPU multiplication on all the 1995 CPUs except Intel's own 486.

For the Pentium there is a simple explanation for that:
Intel did not implement a dedicated integer multiply circuit. Instead, a multiply was performed on the FPU. So it was the same unit doing both mul and fmul, but the integer version had some extra overhead.
I am not sure if others such as AMD and Cyrix followed Intel's approach here.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 39 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
kool kitty89 wrote:

(and likely ray-casting to organize that list and minimize overdraw)

I don't see how ray-casting would work for this case actually. Ray-casting a generic 'polygon soup' list of triangles will be very expensive.
Z-sorting is very efficient, and with a simple affine texturemapper, it is basically faster to draw a pixel than to perform a z-test. So you don't care that much about overdraw.

The only thing that would work on low-end systems like a 486 is what Quake does: build static acceleration structures for the static geometry (such as BSPs), and then you can ray-cast those efficiently, to get a potentially visible list of polygons, in proper depth-order (not requiring any z-sorting at all).
For the dynamic geometry, Quake uses z-buffering (the BSP-polygons perform write-only zbuffering, so that dynamic geometry can get occluded by static geometry properly).

kool kitty89 wrote:

The column renderer method might favor the 486 a bit more than polygon renderers of the same period (with or without texture mapping) due to making better use of the 32-bit wide bus/registers

What do you mean by that? The columns are just 1 pixel wide, which amounts to 1 byte in 256-colour mode. Since you draw vertically, you can't do 16-bit or 32-bit writes. So there isn't much difference between polygon rendering and column rendering, bus-wise. In both case you do one byte at a time.
I would say that a polygon renderer can actually make better use of the 32-bit registers, because you can do more accurate/efficient fixedpoint interpolation with 32-bit registers. For column rendering that doesn't apply.

kool kitty89 wrote:

Wouldn't it work if the MiniGL had been written with integer math in mind in the first place?

Well yes, ultimately there will be fixedpoint coordinates sent to the accelerator, I would think (especially these early ones).
So if you have a lowlevel interface to the accelerator, where you can send screenspace polygons in fixedpoint-format, this can greatly benefit the 486. You could write an entire integer-only renderer on top of such an interface, avoiding the slow FPU.

kool kitty89 wrote:

Quake could probably get away with setting perspective only once per line

I doubt it, since Quake generally renders very large polygons for walls and floors. They would distort heavily over a long run of pixels.

kool kitty89 wrote:

I wonder if Descent actually used a quad renderer

The sources were released a while ago, and are available online: https://github.com/videogamepreservation/descent
Sadly it's a bit of a mess, with a bunch of alternative routines, not entirely sure which ones went into the released version.
But it appears to be a generic convex N-gon renderer, just like Quake. So it can do triangles, quads and more. There is a constant MAX_POINTS_IN_POLY, set to 100: https://github.com/videogamepreservation/desc … aster/3D/3D.INC
This is mainly interesting when you use (approximate) perspective texturemapping. Namely, with proper perspective, the interpolation of gradients is the same across the entire N-gon's surface. If you were to render with affine texturemapping, it would distort too much, and you'd want to treat the N-gon as a triangle-fan, where you render each triangle separately, performing setup for interpolation for every triangle in the N-gon.

kool kitty89 wrote:

Also odd that seams seem to show up quite often, but overlapping polygon edges don't (polygons clipping through eachother where they meet). If it's a matter of rounding vertex data causing THAT instead of open seams, they made a bad decision given slight single-pixel clipping/overlapping like that is far less noticeable than the seams. (unless both happen and only the seams are noticeable)

The problem on the PlayStation is that texture coordinates had limited precision.
This was fine when you rendered triangles as-is, because you could model things in such a way that your textures always fit to the proper coordinates on vertices.
But when you introduce clipping, you run into a problem: You are cutting off a part of the triangle, and introducing new vertices. You have to cut off the texture accordingly. But if you do not have enough precision for your texture coordinates, you cannot fit the texture properly, and it will move around a bit.
The same goes for that other problem: lack of perspective. In order to get some perspective-correction, PS1 renderers would attempt to subdivide large polygons into sets of smaller ones. But again, you have the problem that you can't fit the texture properly on the new vertices.
A combination of these two factors is probably what you're seeing when the textures are 'shifting'/'wobbling' on screen.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/