Disruptor wrote on 2023-02-15, 07:47:
Just 2 bytes had to be changed. One byte for detection of MMX presence when starting the programme, and the other one in Game 2.
To disable the MMX check at the start of 3DMark 2000, you need to patch 3DMark2000.exe . The byte at offset 69626 (hex) is originally 75 (hex) (in the context at that point, this byte instructs the processor to jump if MMX is present) to EB (hex) (which instructs the processor to jump unconditionally. This one is easy.
On a non-MMX computer, the modified 3DMark2000 program (which of course is no longter the actual 3DMark2000) will seem lock up when loading the data for "game 2" benchmarks. Actually, it just displays a message box complaining about bad JPEG data, but you can't see the message box because 3DMark is running in DirectX fullscreen exclusive mode, so the box on the desktop can not be made visible. Pressing "Enter" often enough is likely going to "help" to confirm these message boxes. It turns out that 3DMark damages JPG compressed textures during loading if the computer is not MMX capable. The details are complicated, and explained below. To work around this problem, you can patch rlmfc.dll. At offset 14107 (hex), you will find a byte containing 75 (hex), which again needs to be changed to EB (hex). While this patch definitely fixes the JPG corruption issue, it might have the side effect to produce slightly low scores on non-MMX Pentium systems. This way of handling the issue is definitely not endorsed by Futuremark / MadOnion / Underwriter Laboratories. As this patch only affects systems that do not support MMX, a version patched like that still produces valid results on officially supported systems.
The technical background (you may very well skip this part of the post if you just want to run 3DMark2000 on a non-MMX system, and are not interested in what exactly went wrong and how the fix helps with it): The Pentium Processor has a 64-bit bus interface, but the integer execution unit of the 80486 or Pentium doesn't support any 64-bit data types. For reading of cachable memory, this is not an issue: Whenever a 32-bit value is requested by the code, the Pentium processor requests a complete cache line (4 * 64 bits) from the mainboard, fully utilizing the 64-bit bus interface. The same is true for writes that get handled by the L1 write-back cache: As long as the data stays in the L1 cache, it is not communicated over the frontside bus, and not being able to natively cause 64-bit cycles doesn't impede performance on the frontside bus. When the cache line gets evicted (the processor decides to cache a different area of memory in that cache line, and needs to write back the modified data into L2 cache / RAM), it writes the whole cache lines as 4*64 bits, again fully utilizing the frontside bus.
Things look different for writes that do not hit the L1 cache, though. This might be due to an access to main memory that just happens to not be in L1 cache, but this might also be due to a write to non-cachable memory. The latter case is extremely important if you copy data generated by a software renderer into video memory, as video memory is treated non-cacheable. Whenever you perform a 32-bit write to video memory, the processor starts a 32-bit write cycle over the frontside bus. The processor has a queue (write buffer) that allows the processor to continue operation, while the write is happening, but the write will be sent as 32-bit write from the processor to the mainboard. If the queue is full (IIRC it's 4 writes that can be queued), the processor does need to wait for a write to finish. Most Pentium chipsets will generate a single PCI write transaction when the processor performs a 32-bit write. On the other hand, if the processor would perform a 64-bit write, the mainboard would generate a burst cycle transferring 2*32 bit with just selecting the target card once. If the current task of the processor is to write a chunk of rendered 2D data into video memory, PCI bandwidth and FSB bandwidth are a serious concern, and performing 64-bit writes on the FSB and two-word burst writes on the PCI bus can tremendeously improve performance.
There is a component in the Pentium processor that is able to generate a 64-bit write cycle natively, though. This component is the FPU. The FPU has a very fast "FST" instruction that stores a floating point number from a processor register into a 64-bit memory location. So the fastest way to write lots of data into video memory (or even mainboard memory if you miss the L1 cache) is using the FST instruction. There also is a fast FLD instruction to load 64-bit data from memory into an FPU register. So if you expect a lot of write misses, using FLD to load data into the FPU registers and then storing data from the FPU registers using FST is the optimal way to copy data on a Pentium, as this is the only way to fully utilize the 64-bit bus interface. This is one of the stunts the Quake I pulled off to get the impressive software rendering performance. And while this is probably the best-known stunt, it's by far not the only one. Quake does not use the FPU "only for copying data" as some less informed sources used to claim, but Quake also calculates 3D geometry utilizing the FPU, and most importantly, Quake parallelizes calculations for perspective correction over CPU and FPU to get perspective correction "nearly for free", whereas the naive implementation of perspective correction would require a division per pixel.
With the Pentium MMX, there is another way to properly utilize the 64-bit bus: The native data width of the MMX instructions is 64 bits. While MMX instructions are typically not used for 64-bit arithmetic, they work with 64 bit at a time, performing 8 operation of 8 bit each of 2 operations of 32 bit each in parallel. This implies that there needs to be a 64-bit memory load instruction and a 64-bit memory store instruction to load or store a complete set of values (like 8 times 8 bits or 2 times 32 bits). If the CPU is MMX capable, 3DMark2000 uses MMX instructions to copy large amounts of data from memory to memory. If the CPU is not MMX capable, 3DMark2000 falls back to using the FPU.
There is a quirk with using the FPU as a data copying device, though: The IEEE754 standard defining floating point operations reserved a certain set of bit patterns for values that need "special treatment" and can not be handled by the processor. The specific meaning of those values is up to the operating system or application program that implements the required special treatment. The FPU can be set up in a way that it causes a floating point exception whenever a calculation is performed that uses one of these "special treatment required" bit patterns as operand. This exception can be handled by the operating system or application program, and then it can perform the required "special treatment". At that time, the handler for the special treatment can decide whether it will produce a regular floating point number as output, or the output is still a special thing that requires another special treatment when that value is used. For 64-bit floating point numbers, as they are used by the Intel FPU, all values starting with the 12 most significant values set and the next bit clear are values that require "special treatment". The IEEE754 standard calls these value "signalling NaNs".
The idea of copying values through the FPU just involves loading the 64-bit values and then storing them again, without performing any calculations, so one might expect that you can copy bit patterns that represent signalling NaNs. It turn's out that you can't do that. The reason is that the Intel FPU doesn't have 64 bit registers, instead it has registers of 80 bits each. Any 64-bit value is extended to 80 bits when it gets loaded using FLD, and when the data is stored using FST, the 64 "most important" bits of the 80 bit register get extracted and stored to memory. This process is completely reversible, so for anything (except signalling NaNs), The same 64-bit pattern that was loaded using FLD is stored by FST, even if the intermediate format is 80 bits. In case of signalling NaNs, though, the FPU notices that the control word tells the FPU to not ask the operating system for help with dealing with signalling NaNs ("the exception is masked"). In that case, the FPU is supposed to return a "quiet NaN" from any calculation that involves a NaN, no matter whether the input NaN is quiet or signalling. Quiet NaNs are values that represent "not a number", and calculations that involve NaNs typically just return another NaN again, but in contrast to signalling NaNs, they are meant for cases in which there just is not valid number representing the result (like dividing zero by zero), not for cases in which software hooks can extend the apparent capabilities of the FPU. And that's what happens when 3DMakrk is copying large amounts of data using the FPU. Any 64-bit value that has the top 12 bits set, and the 13th bit clear (the "signalling NaN" pattern of Intel FPUs) is converted to a value with the 13th bit set (the corresponding "quiet NaN"). This modification of the value happens at the time when the FLD attempts to load the signalling 64-bit NaN into the 80-bit FPU register. The conversion operation registers a NaN on input, and outputs a quiet NaN. The FST operation then correctly stores it. This is completely to the specification of an FPU that works with 80-bit values internally, and is handled this way on all x87 implementations, no matter whether it is a 486, a Pentium or even an 8087.
If you use the FPU to copy rendered 256 color data, the consequence of the NaN treatment is that if a pixel in a the 8th column (or the 16th, the 24th, and so on) has color 255 (all 8 bits set), and the pixel left to it has a color value of 240 to 247, that pixel left to the color-255 pixel is replaced by a pixel of color 248 to 255. As long as you control the kind of graphics you render (like Quake does), you can take the appropriate measures that this munging of data isn't problematic. You might, for example, decide to not use color 255. In that case, no 64-bit pattern in the rendered picture is the pattern of any kind of NaN, signalling or not. Or you might choose a palette where colors 240 and 248, colors 241 and 249 and so on are similar enough that the user doesn't notice that the color value got modified by copying the rendered image to the screen. As I'm not a Quake technician, I didn't look up how Quake specically deals with this issue, but both ways seem plausible.
Now, back to 3DMark: The issue is that 3DMark2000 uses the FPU memory copying method for all kinds of large-amount-of-data copies, not just for rendered graphics. Specifically, FPU memory copying is possibly used while loading assets from the 3DMark data files. If the asset is a JPEG compressed image, any bit modified by copying it possibly corrupts the it, and causes the result to not fully conform to the JPEG specification. The memory copying routing in 3DMark uses the FPU code path only if both the source and the destination address are aligned on a 64-bit boundary, because that's the only case in which FPU loads and stores are able to generate single 64-bit cycles. In all other cases, at least the load or the store cycle is broken down into two partial 64-bit cycles, negating most of the performance benefit. My patch to rlmfc.dll fakes the alignment check and causes the code-path for non-aligned buffers to be executed even if the buffers are aligned. This is a good thing(TM) on a 486 processor: The 486 processor doesn't have the 64-bit bus interface that is required to make the floating point load/store stuff effective. Furthermore, the 486 FPU is not fast enough to compete with the 486 integer instructions for accessing memory. Instead, 3DMark uses an unrolled loop of manually copying data through 32-bit integer registers, which should barely beat REP MOVSD on a 486 processor, and is one of the optimal ways to copy data.