Modern graphics on a 486

Reply 300 of 371, by Disruptor

Posted on 2023-02-13, 12:16

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1453
Joined: 2018-03-22, 18:31
Location: European Union

There is a misunderstanding what "always dirty" means. It does not mean that the dirty bit setting in BIOS is always active!

I just like you to compare 32 MB 60 ns @ 180 WT+7/WT+8 with 64 MB 60 ns @ 180 WT+7/WT+8 in 3DMark 99 Max.
Please tell me all 4 results and whether it is stable or not.

With 64 MB 60 ns @ 180 WT+7 you got:
3DMARK99MAX: 427 3DMarks and 468 CPU 3DMarks

With 64 MB 60 ns @ 180 WT+8 you got:
?

With 32 MB 60 ns @ 180 WT+7 you got:
?

With 32 MB 60 ns @ 180 WT+8 you got:
?

Reply 301 of 371, by gonzo

Posted on 2023-02-13, 17:26

gonzo Offline

Rank Member

Rank: Member
Posts: 129
Joined: 2021-01-03, 13:57

Disruptor wrote on 2023-02-13, 12:16:
There is a misunderstanding what "always dirty" means. It does not mean that the dirty bit setting in BIOS is always active! […]
Show full quote

There is a misunderstanding what "always dirty" means. It does not mean that the dirty bit setting in BIOS is always active!

I just like you to compare 32 MB 60 ns @ 180 WT+7/WT+8 with 64 MB 60 ns @ 180 WT+7/WT+8 in 3DMark 99 Max.
Please tell me all 4 results and whether it is stable or not.

With 64 MB 60 ns @ 180 WT+7 you got:
3DMARK99MAX: 427 3DMarks and 468 CPU 3DMarks

With 64 MB 60 ns @ 180 WT+8 you got:
?

With 32 MB 60 ns @ 180 WT+7 you got:
?

With 32 MB 60 ns @ 180 WT+8 you got:
?

With 64 MB 60 ns @ 180 WT+8: 428 3DMARKS; 469 CPU-MARKS

With 32 MB 60 ns @ 180 WT+7: 427 3DMARKS; 469 CPU-MARKS

With 32 MB 60 ns @ 180 WT+8: 425 3DMARKS; 453 CPU-MARKS

In fact, the 3D-scores are equal; no difference between 64MB WT+8 / 64MB WT+7/ 32 MB WT+7.
That's why I choose to use WT+8 in generally (most stable) and set faster RAM-timings because of using of the RAM-module with 50ns.

I did every test only one time, so they can be a few points more or less.

Fortunately, 7bits did work for the test.

For sure, here are once again the BIOS-settings for 60ns I used, witch are equal for 32MB/64MB/WT-mode (appart from 7bits or 8bits)

Attachments

Filename

ZIDA-4DPS_DX5-180_Voodoo3-2000_Tests-RAM-60ns.jpg

File size

182 KiB

Views

1376 views

File license

Fair use/fair dealing exception

Reply 302 of 371, by Disruptor

Posted on 2023-02-13, 18:06

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1453
Joined: 2018-03-22, 18:31
Location: European Union

Well, I haven't expected the With 32 MB 60 ns @ 180 WT+8: 425 3DMARKS; 453 CPU-MARKS (looking like a glitch), but at least it seems you don't have a difference in cacheable area between 64 MB WT+7 and WT+8.

Reply 303 of 371, by CoffeeOne

Posted on 2023-02-13, 20:48

CoffeeOne Offline

Rank Oldbie

Rank: Oldbie
Posts: 1114
Joined: 2019-12-25, 16:12
Location: Austria

Disruptor wrote on 2023-02-13, 18:06:

Well, I haven't expected the With 32 MB 60 ns @ 180 WT+8: 425 3DMARKS; 453 CPU-MARKS (looking like a glitch), but at least it seems you don't have a difference in cacheable area between 64 MB WT+7 and WT+8.

Hello,

I could be wrong, but I believe that the 7bit/8bit setting does not matter in L2 WT mode. => So it is always 8bit.
The 7bit/8bit switch should only play a role in the L2 WB mode.

Reply 304 of 371, by pshipkov

Posted on 2023-02-14, 01:48

pshipkov Offline

Rank Oldbie

Rank: Oldbie
Posts: 1951
Joined: 2018-10-11, 05:08

Correct.

retro bits and bytes

Reply 305 of 371, by gonzo

Posted on 2023-02-14, 09:47

gonzo Offline

Rank Member

Rank: Member
Posts: 129
Joined: 2021-01-03, 13:57

CoffeeOne wrote on 2023-02-13, 20:48:
Hello, […]
Show full quote

Disruptor wrote on 2023-02-13, 18:06:

Well, I haven't expected the With 32 MB 60 ns @ 180 WT+8: 425 3DMARKS; 453 CPU-MARKS (looking like a glitch), but at least it seems you don't have a difference in cacheable area between 64 MB WT+7 and WT+8.

Hello,

I could be wrong, but I believe that the 7bit/8bit setting does not matter in L2 WT mode. => So it is always 8bit.
The 7bit/8bit switch should only play a role in the L2 WB mode.

That's what I mean, too.
Is this so in generally, or does it depend of:
- the chipset?
- the configuration/composition of some L2-modules?
- any BIOS-bug?

Reply 306 of 371, by CoffeeOne

Posted on 2023-02-14, 21:49

CoffeeOne Offline

Rank Oldbie

Rank: Oldbie
Posts: 1114
Joined: 2019-12-25, 16:12
Location: Austria

gonzo wrote on 2023-02-14, 09:47:
That's what I mean, too. Is this so in generally, or does it depend of: - the chipset? - the configuration/composition of some L […]
Show full quote

CoffeeOne wrote on 2023-02-13, 20:48:
Hello, […]
Show full quote

Disruptor wrote on 2023-02-13, 18:06:

Well, I haven't expected the With 32 MB 60 ns @ 180 WT+8: 425 3DMARKS; 453 CPU-MARKS (looking like a glitch), but at least it seems you don't have a difference in cacheable area between 64 MB WT+7 and WT+8.

Hello,

I could be wrong, but I believe that the 7bit/8bit setting does not matter in L2 WT mode. => So it is always 8bit.
The 7bit/8bit switch should only play a role in the L2 WB mode.

That's what I mean, too.
Is this so in generally, or does it depend of:
- the chipset?
- the configuration/composition of some L2-modules?
- any BIOS-bug?

It clearly depends on the chipset.
The SIS471 chipset has this 7bit / 8bit switch forseen for L2 write-back mode.
If it is selectable depends on the BIOS.

8 bit means the cacheable is the same as it is in L2 write-through mode, but the L2 cache works without dirty tag ram functionality.
7 bit means that 1 bit is taken from the tag ram for the dirty bit. It is faster, but it reduces the cacheable area (it is halved).

One can still have L2 WB and full cacheable area, but then an additional SRAM is needed. This is the case for the famous AMI Enterprise IV mainboard, it has SIS 471 chipset and an external dirty tag ram.

The mainboard Asus SV2GX4 has no 7bit/8bit setting, so L2 WB means always without "dirty".
But the patched BIOS "L2 Fix" switches L2 WB to 7bit mode always. => No possibilty to change it.

Reply 307 of 371, by Disruptor

Posted on 2023-02-15, 00:04

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1453
Joined: 2018-03-22, 18:31
Location: European Union

Breaking News:
Benched a 486 with 3DMark 2000 for the very first time.
187 3D marks with a 486/133 and a GeForce 5200 PCI.

Reply 308 of 371, by BitWrangler

Posted on 2023-02-15, 05:46

BitWrangler Online

Rank l33t++

Rank: l33t++
Posts: 6693
Joined: 2017-10-11, 00:55
Location: Ontario

Has HWbot got a 486 category for 3DMark2000? you might get 20 points 🤣

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 309 of 371, by noshutdown

Posted on 2023-02-15, 07:37

noshutdown Offline

Rank Oldbie

Rank: Oldbie
Posts: 1179
Joined: 2010-07-23, 17:04
Location: China

Disruptor wrote on 2023-02-15, 00:04:

Breaking News:
Benched a 486 with 3DMark 2000 for the very first time.
187 3D marks with a 486/133 and a GeForce 5200 PCI.

is it a hacked version of 3dmark2000? i think it refuses to run on any cpus without mmx. 3dmark2001 on the other hand would give a warning but still run.

Reply 310 of 371, by Disruptor

Posted on 2023-02-15, 07:47

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1453
Joined: 2018-03-22, 18:31
Location: European Union

noshutdown wrote on 2023-02-15, 07:37:

Disruptor wrote on 2023-02-15, 00:04:

Breaking News:
Benched a 486 with 3DMark 2000 for the very first time.
187 3D marks with a 486/133 and a GeForce 5200 PCI.

is it a hacked version of 3dmark2000? i think it refuses to run on any cpus without mmx. 3dmark2001 on the other hand would give a warning but still run.

It was patched to disable some checks for MMX presence.
Game 1 ran without any issues.
But in Game 2 there were bit faults when copying a JPG image using the FPU (for speed reasons), so probably the support for non-MMX Pentiums had been dropped too due this reason.
Runs with Pentium (non-MMX) and Cyrix 6x86 will follow somewhen.
Just 2 bytes had to be changed. One byte for detection of MMX presence when starting the programme, and the other one in Game 2.

Last edited by Disruptor on 2023-02-15, 15:10. Edited 1 time in total.

Reply 311 of 371, by gonzo

Posted on 2023-02-15, 09:02

gonzo Offline

Rank Member

Rank: Member
Posts: 129
Joined: 2021-01-03, 13:57

Disruptor wrote on 2023-02-15, 00:04:

Breaking News:
Benched a 486 with 3DMark 2000 for the very first time.
187 3D marks with a 486/133 and a GeForce 5200 PCI.

For me, your postings are the very first time I have recognized a Geforce FX 5200 PCI works good with a 486-board.
Congratulations! 😀
Would you be so kind to upload some pictures of your board (the chipset would be very important, too) and the 5200, that would be very interestiing.

BTW, on your 486-system you will need Dx7 to run 3Dmark 2000. As far as I know, Dx7 is the latest/newest Dx-version for a 486-CPU.
What version exactly are you using: Dx7 "only", or Dx7a, or Dx7.1?

Reply 312 of 371, by Disruptor

Posted on 2023-02-15, 14:21

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1453
Joined: 2018-03-22, 18:31
Location: European Union

Thanks, but most of the work has been done by mkarcher.

gonzo wrote on 2023-02-15, 09:02:

Would you be so kind to upload some pictures of your board (the chipset would be very important, too) and the 5200, that would be very interestiing.

Re: Cheap but well performing PCI 3D video cards
It's a passive cooled ZOTAC GeForce 5200 PCI.
We currently are limited to 33 MHz PCI with that card.
However, there's currently a Frankenstein solution with a soldered connection to a 3.3 V converter.

My 486 UMC8886/8881 Project (Version 2.0)
Of course now without ET6000.
Because the aim was to get a graphics card with DVI output into that 486.

The Matrox G450 PCI runs with 40 MHz PCI but mkarcher patched the BIOS because there were problems with the initialisation of the PCI bus and the HighPoint controller when the PCI clock divider set to 20 MHz during early initialisation. For stability reasons we installed a passive cooler on the PCI-to-AGP-brigde.

Attachments

Filename

DxDiag.txt

File size

8.53 KiB

Downloads

41 downloads

File license

Fair use/fair dealing exception

Reply 313 of 371, by mkarcher

Posted on 2023-02-15, 19:24

mkarcher Online

Rank l33t

Rank: l33t
Posts: 2596
Joined: 2019-01-19, 16:29
Location: Germany

Disruptor wrote on 2023-02-15, 07:47:

Just 2 bytes had to be changed. One byte for detection of MMX presence when starting the programme, and the other one in Game 2.

To disable the MMX check at the start of 3DMark 2000, you need to patch 3DMark2000.exe . The byte at offset 69626 (hex) is originally 75 (hex) (in the context at that point, this byte instructs the processor to jump if MMX is present) to EB (hex) (which instructs the processor to jump unconditionally. This one is easy.

On a non-MMX computer, the modified 3DMark2000 program (which of course is no longter the actual 3DMark2000) will seem lock up when loading the data for "game 2" benchmarks. Actually, it just displays a message box complaining about bad JPEG data, but you can't see the message box because 3DMark is running in DirectX fullscreen exclusive mode, so the box on the desktop can not be made visible. Pressing "Enter" often enough is likely going to "help" to confirm these message boxes. It turns out that 3DMark damages JPG compressed textures during loading if the computer is not MMX capable. The details are complicated, and explained below. To work around this problem, you can patch rlmfc.dll. At offset 14107 (hex), you will find a byte containing 75 (hex), which again needs to be changed to EB (hex). While this patch definitely fixes the JPG corruption issue, it might have the side effect to produce slightly low scores on non-MMX Pentium systems. This way of handling the issue is definitely not endorsed by Futuremark / MadOnion / Underwriter Laboratories. As this patch only affects systems that do not support MMX, a version patched like that still produces valid results on officially supported systems.

The technical background (you may very well skip this part of the post if you just want to run 3DMark2000 on a non-MMX system, and are not interested in what exactly went wrong and how the fix helps with it): The Pentium Processor has a 64-bit bus interface, but the integer execution unit of the 80486 or Pentium doesn't support any 64-bit data types. For reading of cachable memory, this is not an issue: Whenever a 32-bit value is requested by the code, the Pentium processor requests a complete cache line (4 * 64 bits) from the mainboard, fully utilizing the 64-bit bus interface. The same is true for writes that get handled by the L1 write-back cache: As long as the data stays in the L1 cache, it is not communicated over the frontside bus, and not being able to natively cause 64-bit cycles doesn't impede performance on the frontside bus. When the cache line gets evicted (the processor decides to cache a different area of memory in that cache line, and needs to write back the modified data into L2 cache / RAM), it writes the whole cache lines as 4*64 bits, again fully utilizing the frontside bus.

Things look different for writes that do not hit the L1 cache, though. This might be due to an access to main memory that just happens to not be in L1 cache, but this might also be due to a write to non-cachable memory. The latter case is extremely important if you copy data generated by a software renderer into video memory, as video memory is treated non-cacheable. Whenever you perform a 32-bit write to video memory, the processor starts a 32-bit write cycle over the frontside bus. The processor has a queue (write buffer) that allows the processor to continue operation, while the write is happening, but the write will be sent as 32-bit write from the processor to the mainboard. If the queue is full (IIRC it's 4 writes that can be queued), the processor does need to wait for a write to finish. Most Pentium chipsets will generate a single PCI write transaction when the processor performs a 32-bit write. On the other hand, if the processor would perform a 64-bit write, the mainboard would generate a burst cycle transferring 2*32 bit with just selecting the target card once. If the current task of the processor is to write a chunk of rendered 2D data into video memory, PCI bandwidth and FSB bandwidth are a serious concern, and performing 64-bit writes on the FSB and two-word burst writes on the PCI bus can tremendeously improve performance.

There is a component in the Pentium processor that is able to generate a 64-bit write cycle natively, though. This component is the FPU. The FPU has a very fast "FST" instruction that stores a floating point number from a processor register into a 64-bit memory location. So the fastest way to write lots of data into video memory (or even mainboard memory if you miss the L1 cache) is using the FST instruction. There also is a fast FLD instruction to load 64-bit data from memory into an FPU register. So if you expect a lot of write misses, using FLD to load data into the FPU registers and then storing data from the FPU registers using FST is the optimal way to copy data on a Pentium, as this is the only way to fully utilize the 64-bit bus interface. This is one of the stunts the Quake I pulled off to get the impressive software rendering performance. And while this is probably the best-known stunt, it's by far not the only one. Quake does not use the FPU "only for copying data" as some less informed sources used to claim, but Quake also calculates 3D geometry utilizing the FPU, and most importantly, Quake parallelizes calculations for perspective correction over CPU and FPU to get perspective correction "nearly for free", whereas the naive implementation of perspective correction would require a division per pixel.

With the Pentium MMX, there is another way to properly utilize the 64-bit bus: The native data width of the MMX instructions is 64 bits. While MMX instructions are typically not used for 64-bit arithmetic, they work with 64 bit at a time, performing 8 operation of 8 bit each of 2 operations of 32 bit each in parallel. This implies that there needs to be a 64-bit memory load instruction and a 64-bit memory store instruction to load or store a complete set of values (like 8 times 8 bits or 2 times 32 bits). If the CPU is MMX capable, 3DMark2000 uses MMX instructions to copy large amounts of data from memory to memory. If the CPU is not MMX capable, 3DMark2000 falls back to using the FPU.

There is a quirk with using the FPU as a data copying device, though: The IEEE754 standard defining floating point operations reserved a certain set of bit patterns for values that need "special treatment" and can not be handled by the processor. The specific meaning of those values is up to the operating system or application program that implements the required special treatment. The FPU can be set up in a way that it causes a floating point exception whenever a calculation is performed that uses one of these "special treatment required" bit patterns as operand. This exception can be handled by the operating system or application program, and then it can perform the required "special treatment". At that time, the handler for the special treatment can decide whether it will produce a regular floating point number as output, or the output is still a special thing that requires another special treatment when that value is used. For 64-bit floating point numbers, as they are used by the Intel FPU, all values starting with the 12 most significant values set and the next bit clear are values that require "special treatment". The IEEE754 standard calls these value "signalling NaNs".

The idea of copying values through the FPU just involves loading the 64-bit values and then storing them again, without performing any calculations, so one might expect that you can copy bit patterns that represent signalling NaNs. It turn's out that you can't do that. The reason is that the Intel FPU doesn't have 64 bit registers, instead it has registers of 80 bits each. Any 64-bit value is extended to 80 bits when it gets loaded using FLD, and when the data is stored using FST, the 64 "most important" bits of the 80 bit register get extracted and stored to memory. This process is completely reversible, so for anything (except signalling NaNs), The same 64-bit pattern that was loaded using FLD is stored by FST, even if the intermediate format is 80 bits. In case of signalling NaNs, though, the FPU notices that the control word tells the FPU to not ask the operating system for help with dealing with signalling NaNs ("the exception is masked"). In that case, the FPU is supposed to return a "quiet NaN" from any calculation that involves a NaN, no matter whether the input NaN is quiet or signalling. Quiet NaNs are values that represent "not a number", and calculations that involve NaNs typically just return another NaN again, but in contrast to signalling NaNs, they are meant for cases in which there just is not valid number representing the result (like dividing zero by zero), not for cases in which software hooks can extend the apparent capabilities of the FPU. And that's what happens when 3DMakrk is copying large amounts of data using the FPU. Any 64-bit value that has the top 12 bits set, and the 13th bit clear (the "signalling NaN" pattern of Intel FPUs) is converted to a value with the 13th bit set (the corresponding "quiet NaN"). This modification of the value happens at the time when the FLD attempts to load the signalling 64-bit NaN into the 80-bit FPU register. The conversion operation registers a NaN on input, and outputs a quiet NaN. The FST operation then correctly stores it. This is completely to the specification of an FPU that works with 80-bit values internally, and is handled this way on all x87 implementations, no matter whether it is a 486, a Pentium or even an 8087.

If you use the FPU to copy rendered 256 color data, the consequence of the NaN treatment is that if a pixel in a the 8th column (or the 16th, the 24th, and so on) has color 255 (all 8 bits set), and the pixel left to it has a color value of 240 to 247, that pixel left to the color-255 pixel is replaced by a pixel of color 248 to 255. As long as you control the kind of graphics you render (like Quake does), you can take the appropriate measures that this munging of data isn't problematic. You might, for example, decide to not use color 255. In that case, no 64-bit pattern in the rendered picture is the pattern of any kind of NaN, signalling or not. Or you might choose a palette where colors 240 and 248, colors 241 and 249 and so on are similar enough that the user doesn't notice that the color value got modified by copying the rendered image to the screen. As I'm not a Quake technician, I didn't look up how Quake specically deals with this issue, but both ways seem plausible.

Now, back to 3DMark: The issue is that 3DMark2000 uses the FPU memory copying method for all kinds of large-amount-of-data copies, not just for rendered graphics. Specifically, FPU memory copying is possibly used while loading assets from the 3DMark data files. If the asset is a JPEG compressed image, any bit modified by copying it possibly corrupts the it, and causes the result to not fully conform to the JPEG specification. The memory copying routing in 3DMark uses the FPU code path only if both the source and the destination address are aligned on a 64-bit boundary, because that's the only case in which FPU loads and stores are able to generate single 64-bit cycles. In all other cases, at least the load or the store cycle is broken down into two partial 64-bit cycles, negating most of the performance benefit. My patch to rlmfc.dll fakes the alignment check and causes the code-path for non-aligned buffers to be executed even if the buffers are aligned. This is a good thing(TM) on a 486 processor: The 486 processor doesn't have the 64-bit bus interface that is required to make the floating point load/store stuff effective. Furthermore, the 486 FPU is not fast enough to compete with the 486 integer instructions for accessing memory. Instead, 3DMark uses an unrolled loop of manually copying data through 32-bit integer registers, which should barely beat REP MOVSD on a 486 processor, and is one of the optimal ways to copy data.

Reply 314 of 371, by gonzo

Posted on 2023-02-16, 08:40

gonzo Offline

Rank Member

Rank: Member
Posts: 129
Joined: 2021-01-03, 13:57

mkarcher wrote on 2023-02-15, 19:24:
To disable the MMX check at the start of 3DMark 2000, you need to patch 3DMark2000.exe . The byte at offset 69626 (hex) is origi […]
Show full quote

Disruptor wrote on 2023-02-15, 07:47:

Just 2 bytes had to be changed. One byte for detection of MMX presence when starting the programme, and the other one in Game 2.

To disable the MMX check at the start of 3DMark 2000, you need to patch 3DMark2000.exe . The byte at offset 69626 (hex) is originally 75 (hex) (in the context at that point, this byte instructs the processor to jump if MMX is present) to EB (hex) (which instructs the processor to jump unconditionally. This one is easy.

On a non-MMX computer, the modified 3DMark2000 program (which of course is no longter the actual 3DMark2000) will seem lock up when loading the data for "game 2" benchmarks. Actually, it just displays a message box complaining about bad JPEG data, but you can't see the message box because 3DMark is running in DirectX fullscreen exclusive mode, so the box on the desktop can not be made visible. Pressing "Enter" often enough is likely going to "help" to confirm these message boxes. It turns out that 3DMark damages JPG compressed textures during loading if the computer is not MMX capable. The details are complicated, and explained below. To work around this problem, you can patch rlmfc.dll. At offset 14107 (hex), you will find a byte containing 75 (hex), which again needs to be changed to EB (hex). While this patch definitely fixes the JPG corruption issue, it might have the side effect to produce slightly low scores on non-MMX Pentium systems. This way of handling the issue is definitely not endorsed by Futuremark / MadOnion / Underwriter Laboratories. As this patch only affects systems that do not support MMX, a version patched like that still produces valid results on officially supported systems.

The technical background (you may very well skip this part of the post if you just want to run 3DMark2000 on a non-MMX system, and are not interested in what exactly went wrong and how the fix helps with it): The Pentium Processor has a 64-bit bus interface, but the integer execution unit of the 80486 or Pentium doesn't support any 64-bit data types. For reading of cachable memory, this is not an issue: Whenever a 32-bit value is requested by the code, the Pentium processor requests a complete cache line (4 * 64 bits) from the mainboard, fully utilizing the 64-bit bus interface. The same is true for writes that get handled by the L1 write-back cache: As long as the data stays in the L1 cache, it is not communicated over the frontside bus, and not being able to natively cause 64-bit cycles doesn't impede performance on the frontside bus. When the cache line gets evicted (the processor decides to cache a different area of memory in that cache line, and needs to write back the modified data into L2 cache / RAM), it writes the whole cache lines as 4*64 bits, again fully utilizing the frontside bus.

Things look different for writes that do not hit the L1 cache, though. This might be due to an access to main memory that just happens to not be in L1 cache, but this might also be due to a write to non-cachable memory. The latter case is extremely important if you copy data generated by a software renderer into video memory, as video memory is treated non-cacheable. Whenever you perform a 32-bit write to video memory, the processor starts a 32-bit write cycle over the frontside bus. The processor has a queue (write buffer) that allows the processor to continue operation, while the write is happening, but the write will be sent as 32-bit write from the processor to the mainboard. If the queue is full (IIRC it's 4 writes that can be queued), the processor does need to wait for a write to finish. Most Pentium chipsets will generate a single PCI write transaction when the processor performs a 32-bit write. On the other hand, if the processor would perform a 64-bit write, the mainboard would generate a burst cycle transferring 2*32 bit with just selecting the target card once. If the current task of the processor is to write a chunk of rendered 2D data into video memory, PCI bandwidth and FSB bandwidth are a serious concern, and performing 64-bit writes on the FSB and two-word burst writes on the PCI bus can tremendeously improve performance.

There is a component in the Pentium processor that is able to generate a 64-bit write cycle natively, though. This component is the FPU. The FPU has a very fast "FST" instruction that stores a floating point number from a processor register into a 64-bit memory location. So the fastest way to write lots of data into video memory (or even mainboard memory if you miss the L1 cache) is using the FST instruction. There also is a fast FLD instruction to load 64-bit data from memory into an FPU register. So if you expect a lot of write misses, using FLD to load data into the FPU registers and then storing data from the FPU registers using FST is the optimal way to copy data on a Pentium, as this is the only way to fully utilize the 64-bit bus interface. This is one of the stunts the Quake I pulled off to get the impressive software rendering performance. And while this is probably the best-known stunt, it's by far not the only one. Quake does not use the FPU "only for copying data" as some less informed sources used to claim, but Quake also calculates 3D geometry utilizing the FPU, and most importantly, Quake parallelizes calculations for perspective correction over CPU and FPU to get perspective correction "nearly for free", whereas the naive implementation of perspective correction would require a division per pixel.

With the Pentium MMX, there is another way to properly utilize the 64-bit bus: The native data width of the MMX instructions is 64 bits. While MMX instructions are typically not used for 64-bit arithmetic, they work with 64 bit at a time, performing 8 operation of 8 bit each of 2 operations of 32 bit each in parallel. This implies that there needs to be a 64-bit memory load instruction and a 64-bit memory store instruction to load or store a complete set of values (like 8 times 8 bits or 2 times 32 bits). If the CPU is MMX capable, 3DMark2000 uses MMX instructions to copy large amounts of data from memory to memory. If the CPU is not MMX capable, 3DMark2000 falls back to using the FPU.

There is a quirk with using the FPU as a data copying device, though: The IEEE754 standard defining floating point operations reserved a certain set of bit patterns for values that need "special treatment" and can not be handled by the processor. The specific meaning of those values is up to the operating system or application program that implements the required special treatment. The FPU can be set up in a way that it causes a floating point exception whenever a calculation is performed that uses one of these "special treatment required" bit patterns as operand. This exception can be handled by the operating system or application program, and then it can perform the required "special treatment". At that time, the handler for the special treatment can decide whether it will produce a regular floating point number as output, or the output is still a special thing that requires another special treatment when that value is used. For 64-bit floating point numbers, as they are used by the Intel FPU, all values starting with the 12 most significant values set and the next bit clear are values that require "special treatment". The IEEE754 standard calls these value "signalling NaNs".

The idea of copying values through the FPU just involves loading the 64-bit values and then storing them again, without performing any calculations, so one might expect that you can copy bit patterns that represent signalling NaNs. It turn's out that you can't do that. The reason is that the Intel FPU doesn't have 64 bit registers, instead it has registers of 80 bits each. Any 64-bit value is extended to 80 bits when it gets loaded using FLD, and when the data is stored using FST, the 64 "most important" bits of the 80 bit register get extracted and stored to memory. This process is completely reversible, so for anything (except signalling NaNs), The same 64-bit pattern that was loaded using FLD is stored by FST, even if the intermediate format is 80 bits. In case of signalling NaNs, though, the FPU notices that the control word tells the FPU to not ask the operating system for help with dealing with signalling NaNs ("the exception is masked"). In that case, the FPU is supposed to return a "quiet NaN" from any calculation that involves a NaN, no matter whether the input NaN is quiet or signalling. Quiet NaNs are values that represent "not a number", and calculations that involve NaNs typically just return another NaN again, but in contrast to signalling NaNs, they are meant for cases in which there just is not valid number representing the result (like dividing zero by zero), not for cases in which software hooks can extend the apparent capabilities of the FPU. And that's what happens when 3DMakrk is copying large amounts of data using the FPU. Any 64-bit value that has the top 12 bits set, and the 13th bit clear (the "signalling NaN" pattern of Intel FPUs) is converted to a value with the 13th bit set (the corresponding "quiet NaN"). This modification of the value happens at the time when the FLD attempts to load the signalling 64-bit NaN into the 80-bit FPU register. The conversion operation registers a NaN on input, and outputs a quiet NaN. The FST operation then correctly stores it. This is completely to the specification of an FPU that works with 80-bit values internally, and is handled this way on all x87 implementations, no matter whether it is a 486, a Pentium or even an 8087.

If you use the FPU to copy rendered 256 color data, the consequence of the NaN treatment is that if a pixel in a the 8th column (or the 16th, the 24th, and so on) has color 255 (all 8 bits set), and the pixel left to it has a color value of 240 to 247, that pixel left to the color-255 pixel is replaced by a pixel of color 248 to 255. As long as you control the kind of graphics you render (like Quake does), you can take the appropriate measures that this munging of data isn't problematic. You might, for example, decide to not use color 255. In that case, no 64-bit pattern in the rendered picture is the pattern of any kind of NaN, signalling or not. Or you might choose a palette where colors 240 and 248, colors 241 and 249 and so on are similar enough that the user doesn't notice that the color value got modified by copying the rendered image to the screen. As I'm not a Quake technician, I didn't look up how Quake specically deals with this issue, but both ways seem plausible.

Now, back to 3DMark: The issue is that 3DMark2000 uses the FPU memory copying method for all kinds of large-amount-of-data copies, not just for rendered graphics. Specifically, FPU memory copying is possibly used while loading assets from the 3DMark data files. If the asset is a JPEG compressed image, any bit modified by copying it possibly corrupts the it, and causes the result to not fully conform to the JPEG specification. The memory copying routing in 3DMark uses the FPU code path only if both the source and the destination address are aligned on a 64-bit boundary, because that's the only case in which FPU loads and stores are able to generate single 64-bit cycles. In all other cases, at least the load or the store cycle is broken down into two partial 64-bit cycles, negating most of the performance benefit. My patch to rlmfc.dll fakes the alignment check and causes the code-path for non-aligned buffers to be executed even if the buffers are aligned. This is a good thing(TM) on a 486 processor: The 486 processor doesn't have the 64-bit bus interface that is required to make the floating point load/store stuff effective. Furthermore, the 486 FPU is not fast enough to compete with the 486 integer instructions for accessing memory. Instead, 3DMark uses an unrolled loop of manually copying data through 32-bit integer registers, which should barely beat REP MOVSD on a 486 processor, and is one of the optimal ways to copy data.

Thank you, mkarcher !
Can you upload the patched file for the 3DMARK2000 here, please?

Two questions:
- are the scores of two sistems, one of wich HAS the patched version (e.g. a 486-system), and the other HAS NOT this version (e.g. a Pentium-system), comparable to each other,
or, if the answer is "no":
- must this patch be used on a newer system (e.g. Pentium, wich usually does not need it -> that means would it work on it in generally,too`?) to be comparable to a 486-/non-Pentium-system

Reply 315 of 371, by gonzo

Posted on 2023-02-16, 08:52

gonzo Offline

Rank Member

Rank: Member
Posts: 129
Joined: 2021-01-03, 13:57

feipoa wrote on 2023-02-13, 04:15:

I've not had good luck with the 4DPS for speed and stability.

Once you think you've found a stable configuration, try installing Windows 2000 to see if it finishes thru to completion without errors. Letting GLQuake run in loop for 2 hours is another telling, yet simple test.

To be honest, I am a bit scarry to do a long-time-test with this rare DX5-CPU running at 5,0 Volt (at the moment, I would use the system as a "study"-case only, or for reference).

BTW, have you problems running any IDE-device on the onboard secondary IDE-connector?
On my board, this results in an unstable Windows (even during the start of it), regardless of a HDD, CD/DVD-drive, or Iomega-ZIP-drive connected.
Without any IDE-device connected, the second IDE-controller is visible and installed corecctly in the Windows-device-manager.
In fact, I can use the primary IDE-interface only 🙁

Reply 316 of 371, by mkarcher

Posted on 2023-02-16, 18:05

mkarcher Online

Rank l33t

Rank: l33t
Posts: 2596
Joined: 2019-01-19, 16:29
Location: Germany

gonzo wrote on 2023-02-16, 08:40:
Two questions: - are the scores of two sistems, one of wich HAS the patched version (e.g. a 486-system), and the other HAS NOT t […]
Show full quote

Two questions:
- are the scores of two sistems, one of wich HAS the patched version (e.g. a 486-system), and the other HAS NOT this version (e.g. a Pentium-system), comparable to each other,
or, if the answer is "no":
- must this patch be used on a newer system (e.g. Pentium, wich usually does not need it -> that means would it work on it in generally,too`?) to be comparable to a 486-/non-Pentium-system

This patch does not affect behaviour of 3DMark2000 on MMX-capable system at all, so there is definitely no point in discussing whether a "new" system should run the original or the patched version.

The patched version is, as I understand it, even better suited to 486 processors than the unpatched version (if it would work as intended). In my oppinion, this makes the "486-optimized" patched version a better fit for comparing 486 performance to the "Pentium-MMX-optimized" original version on Pentium MMX computers. If you are serious in getting the maximum possible performance from non-MMX Pentium systems, a more specific patch that still uses FPU-memcopy for "unimportant" data, and only falls back to integer-memcopy on sensitive data might make sense. Creating a 3DMark2000 variant that perfectly adapts to a non-MMX-Pentium is more work than I currently intend to spend on that thing, though.

So, my answer is: Just compare scores, it should be fine.

Reply 317 of 371, by mkarcher

Posted on 2023-02-16, 18:10

mkarcher Online

Rank l33t

Rank: l33t
Posts: 2596
Joined: 2019-01-19, 16:29
Location: Germany

gonzo wrote on 2023-02-16, 08:40:

Can you upload the patched file for the 3DMARK2000 here, please?

Attachments

Filename

3DMark2000-486.zip

File size

641.06 KiB

Downloads

46 downloads

File comment

patched 3DMark 2000 to run on 486 computers

File license

Fair use/fair dealing exception

Reply 318 of 371, by noshutdown

Posted on 2023-02-17, 05:37

noshutdown Offline

Rank Oldbie

Rank: Oldbie
Posts: 1179
Joined: 2010-07-23, 17:04
Location: China

mkarcher wrote on 2023-02-15, 19:24:

To disable the MMX check at the start of 3DMark 2000, you need to patch 3DMark2000.exe . The byte at offset 69626 (hex) is originally 75 (hex) (in the context at that point, this byte instructs the processor to jump if MMX is present) to EB (hex) (which instructs the processor to jump unconditionally. This one is easy.

thanks for the info, thats very interesting.

i also wonder if anyone has tried rendition cards(1000 and 2100) on 486?

Reply 319 of 371, by Intel486dx33

Posted on 2023-02-17, 10:00

Intel486dx33 Offline

Rank l33t

Rank: l33t
Posts: 4831
Joined: 2018-05-17, 01:17
Location: U.S.A.

What graphics card could you recommend for this build ?
See my post for specs:
AMD 5x86@160mhz., Media Vision PAS16. ( Win 95 )

Main menu

Common searches

Topic actions

Reply 300 of 371, by Disruptor

Reply 301 of 371, by gonzo

Attachments

Reply 302 of 371, by Disruptor

Reply 303 of 371, by CoffeeOne

Reply 304 of 371, by pshipkov

Reply 305 of 371, by gonzo

Reply 306 of 371, by CoffeeOne

Reply 307 of 371, by Disruptor

Reply 308 of 371, by BitWrangler

Reply 309 of 371, by noshutdown

Reply 310 of 371, by Disruptor

Reply 311 of 371, by gonzo

Reply 312 of 371, by Disruptor

Attachments

Reply 313 of 371, by mkarcher

Reply 314 of 371, by gonzo

Reply 315 of 371, by gonzo

Reply 316 of 371, by mkarcher

Reply 317 of 371, by mkarcher

Attachments

Reply 318 of 371, by noshutdown

Reply 319 of 371, by Intel486dx33