I'm still trying to work out whether the Unreal Engine relies on MMX for the software renderer or not, or rather, how much it relies on it. Since it does make heavy use of MMX for the sound processing, that at least means it's unappealing to use FPU operations in parallel (especially a Quake-like pipelined FP-heavy, register-swap heavy mechanism) and could simply instead be an ALU-optimized renderer (possibly pentium-specific, or maybe more generalized for the range of AMD, Cyrix, and multi-generational Intel CPUs out there: though mostly the P55C and PII/Celeron given their MMX target).
They could use MMX for 3D matrix math and such while the actual pixel/texel rasterizer is on the straight ALU end, but I'd think the hardware SIMD functionality of MMX would make that appealing for MMX's 16bpp optimized RGB format with pack and unpack functionality and the fact 16-bit (or 15-bit) RGB doesn't allign with bytes or nybbles, so you need a lot of 5 and 6 bit shifts to pack and unpack RGB elements for shading, lighting, reflection, and alpha blending effects. (and only some alpha is dithered by default, namely 3D alpha/transluency effects, plus that can be disabled in the command line with little/no significant performance loss when I tried it)
Plus, the software renderer appears to disable the dithered texture filtering effect on non-MMX CPUs and seems to have minor rendering errors (my attempt with a Cyrix 6x86L had mipmap-error ish looking random red/green/blue pixel dot shimmering sorts of artifacts). Though that may just been an overall attempt to boost performance on presumed-slow processors. It may also resort to 16-bit integer math to try and speed things up. (I'm not sure on the P5 vs P6 vs Cyrix vs AMD speed advantages for actual 16-bit integer math over 32-bit, but if nothing else you'd be able to make use of 16-bit split registers and some speed gains with the larger number of physical registers available ... also a bigger deal for the P5 family vs CPUs with extended register sets with renaming ... also a big deal for the 486 class CPUs, though I don't think actual DX4/x5 CPUs were ever intended to be realistic platforms for Unreal's software renderer ... maybe they had the Winchip C6 or Winchip 2 in mind, especially since it at least supported MMX and the Winchip 2 had pretty decent MMX performance)
I haven't done Unreal based framerate/performance comparisons between CPUs and haven't seen that come up in other testing/benchmark compilations so far, so can't really glean anything from this beyond anecdotal testing I did a few years back where P55C, K6 (3.2V), and Cyrix MII all at 2.5x100 MHz, and performance all seemed pretty similar in both the software renderer and in Glide with a Voodoo 3 (2500 I think) and K6-2 seemed to be pretty close as well, but I didn't have a frame counter/indicator enabled for any of that. (but given the huge MMX performance boost seen in the K6-2 vs K6, there at least should be some difference there)
Additionally, I don't think there'd be much or any incentive to actually make use of 3DNow! functionality in the K6-2 if the engine was already ALU+MMX optimized around 16/32-bit integer ops, since the K6-2's MMX integer performance seems to be consistently better than its FP 3DNow! performance and MMX was a more widely supported feature at the time. (honestly, it would've made way more sense for Quake/Quake II to have an MMX patch than 3DNow! for the same reason, even if just for the OpenGL renderers, but I assume they didn't want to bother with changing other code to work around integers in their GL renderer implementation ... though Open GL itself supports integer and floating point format vertex data, so that wouldn't have been the reason; besides that, they could've supported integer math in GLquake/Quake II from the start as an option for better performance on CPUs with faster integer performance, or potentially even tweaking the software renderer to offload certain computations to the ALU: like using fixed point vertex math while not changing the span rasterizer itself, and associated FP perspective computation: something that would even help the K5 and 6x86 a good deal as both supported FPU pipelines or prefetch queues, as did the Cyrix 5x85 I believe, so allowing parallel execution like the P5 ... just with very different bias towards actual execution times, while 486 class CPUs and the K6 family didn't support that same sort of parallelism AFIK, even though the K6 FPU had significantly faster execution than the Cyrix one for many operations and even faster or lower latency than the P5 or P6 for some operations, particularly multiplication)
Plus, Unreal could be potentially quite playable on a fast 486 class CPU in Glide (or maybe MeTAL) modes, maybe Direct3D, so using integer math would help a good deal there, too, and much moreso on K5 and 6x86 based systems. (the K5 had particularly fast integer mul/div performance, which I'm pretty sure is why it also scores so high in Sandra's integer multimedia test without MMX, with good DSP-like multiply-accumulate performance without use of MMX, which also means it should be among the best non-MMX CPUs to run the sound driver in high quality mode, even if Epic didn't specifically consider that or at least didn't auto-detect it)
8-bit and 24/32-bit color modes are much more ALU rendering friendly as they pack/unpack easily along byte boundaries and pixels can be drawn as single 8-bit or 32-bit words (assuming unpacked 24-bit format), though 8bpp doesn't really pack/unpack at all, but relies on tables for indexed shading/lighting/blending effects (usually, though technically a packed 4+4 bit format could be used with logical 16x16 color/shade array for logical shading, or 16x16 color array for logical color blending with shading done via look-up). But Unreal uses 16-bit hicolor by default, doesn't support 8-bit rendering, and doesn't seem to have an affinity for the 32-bit color software rendering mode (I should compare framerates with dithered blending enabled and disabled at 16 and 32-bit color to be more sure on this, but haven't) I assume it makes significant use of pack/unpack SIMD MMX functionality at least for certain lighting and blending effects to make for fast 16-bit rendering. And even with blending aside, a pixel-by-pixel rasterizer would have little speed loss going from 16 to 32-bit color depth since it wouldn't be packing pixels into an output buffer for faster burst writes.
The exception might still be for non-MMX CPU rendering depending whether they set-up some sort of software-SIMD routine for packing pixels or a more direct/brute-force method of pixel-by-pixel bit-shifting to pack/unpack RGB elements. (in which case, there should be a real speed gain in 32-bit rendering mode: bandwidth usage would be higher, but that would be largely irrelevant for a pixel by pixel rasterizer since it's still going to be doing individual texel reads and pixel writes and not packing/buffering lines of pixels. (performance boost from cached read/write operations would be the main advantage for 16-bit data there, plus a smaller framebuffer)