AncapDude wrote on 2025-08-27, 02:07:
jakethompson1 wrote on 2025-08-27, 01:20:
This is discussed somewhere by mkarcher, I believe it's because the write access pattern by speedsys is such that it's composed entirely of cache misses, and because 486 chips and archetypal chipset cache are "write-around" (on a write miss, cache is just skipped and the write goes straight to RAM) rather than "write-allocate" (in which a cache is filled first on a write-miss just like it would be on a read-miss, then the in-cache data are modified in hopes of achieving write-back behavior), all you see is the RAM write performance.
Thx for the explanation. So i don't need to be worried about the flatline? Otherwise i would configure the cache to Write-Through a try. I also tried lowering the ram size but that didn't change anything either.
As I've been back in to playing with 486es for about five years now I can summarize this and hopefully avoid you a lot of hassle 😁
First remember there are two caches, an 8K (or 16K in later 486es) cache internal to the 486 CPU, and a 0K-1024K external cache managed by the chipset, usually 256K.
Write-through means that cache and RAM are kept synchronized; cache can never become "newer" than RAM or in other words "dirty". The only caveat is if a device other than the CPU writes directly to RAM (like the floppy controller), the cache must also be updated or simply invalidated; either way the next time the CPU reads that location, it will see the new data.
Write-back tries to keep writes affecting only the cache and postpones the update to RAM as long as possible. The hope is that for frequently written locations like a loop counter, many writes to slower RAM can be eliminated entirely. But this means the RAM contents are potentially "older" than cache. So if the floppy controller tries to access RAM (not just writes now but reads) and the corresponding cache line is "dirty" the CPU or cache has to step in and block that access and update the affected area in RAM first, before the floppy controller is allowed to read it.
To cut down on bookkeeping overhead and take advantage of burst reads (this is what 2-1-1-1 and 3-1-1-1 and 3-2-2-2 are about) 486 cache only operates in "lines" of 16 bytes (486 internal cache).
In a write-back cache, each line gets a bit of metadata keeping track of whether it is still "clean" (freshly copied from RAM) or "dirty" (the CPU has written to it, even if it's only 1 it changed in the 128 bit cache line). The other piece of metadata is the "tag" which for a direct-mapped external cache, just stores a partial memory address in RAM of the data currently stored in the cache; the other bits of that memory address are implied based on the position in cache.
It's the cache controller's job to set the dirty bit to a '1' if it isn't already, every time the cache line is written.
Here is where performance tuning gets complicated. Earlier 486 chipsets require 32 bits of data RAM (or 64 bits for double-bank), 8 bits of tag RAM, and 1 bit of dirty RAM for a properly working write-back cache.
For some reason, many motherboard makers omit the dirty RAM to save on the cost of one chip, and instead simply wire the would-be dirty RAM's data line to a resistor to make it always read '1'. I have no idea why doing this cost cutting, yet going to the trouble to design-in and populate a Weitek 4167 socket on the board as is common on affected boards, made any sense. Such a configuration is called "always dirty" and operates as that sounds. Cache read misses cause the prior 16 bytes of data being forced out of the cache, to get written to RAM, even if the contents are still completely identical to RAM--because the cache has no way of tracking whether that is still true. Overall memory performance is decreased accordingly. In write-through, of course, no dirty bit is needed or beneficial.
In the worst case, such as the OPTi 495SLC chipset, the chipset is designed to permanently operate its external cache in write-back "always dirty" mode, or disabled, and you can do nothing about it other than save up for a better board.
Other chipsets, such as the UMC 481, can operate the external cache in write-back mode with a dedicated Dirty RAM, or disabled. While it is common for motherboard makers to omit the Dirty RAM thereby implementing "Always Dirty", fixing that could be as simple as populating an empty socket the board maker left for this RAM, or by hacking the board like so: UM481/UM491 "Always Dirty" modification HOWTO
Other chipsets, such as the SiS 461, can operate their external cache in either write-back mode with a dedicated Dirty RAM, or in write-through mode, or disabled. If the motherboard maker omitted the Dirty RAM, this gives you the choices: use write-back in "Always Dirty" configuration, use write-through, or modify the board as above to achieve write-back with dirty bit.
The final generation of chipsets--basically the last generation VLB ones with all the bells and whistles for P24D/DX4/Am5x86 CPUs, and anything PCI--is most flexible. They can do write-through, write-back with an external dirty bit (which is 99.9% of the time going to mean always-dirty), or what is called write-back in 7+1 mode. In 7+1 mode, one bit of tag RAM is stolen to instead track whether the line is dirty. Stealing this bit halves the cacheable area in main memory. So if you have 256K cache, 64MB is your cacheable limit in always-dirty, working external dirty bit, or write-through mode, and 32MB is your cacheable limit in 7+1 write-back mode. Double those amounts for 512K and double them again for 1024K. Arbitrary chipset limitations could reduce these limits further. Those limits are obviously far more relevant now than when these decisions were made.
Up through the original DX2, the internal cache is always write-through, and the external cache is write-back or nothing in common chipsets, or at least write-back by default with the possibility of switching to write-through.
This seems a little backward to me; especially as the CPU's clock multiplier becomes greater than 1, the opposite configuration makes more sense. "Write buffers" ease the pain of write-through, by allowing the CPU to continue to execute code while prior writes to RAM or external cache finish, and would be most effective at 1X multiplier. Later 486 CPUs are capable of running the internal cache in write-back, while later 486 chipsets are more likely to be able to run the external cache in write-through.
Anyway, I recently got an SiS 496 board to try for the first time so I have benchmarks at hand. In both cases I have the Am5x86-133 running its internal cache in write-back.
doom -timedemo demo3: 2134 gametics in 1596 realtics in 256K write-through; 2134 gametics in 1592 gametics in 256K write-back; bonus: 2134 gametics in 1548 realtics in 512K write-back.
excerpted speedsys overall memory scores: L2 47.37 MB/s in write-through and 49.19 MB/s in write-back; Memory 38.56 MB/s in write-through and 37.74 MB/s.
It makes sense. Write-back makes L2 read misses more expensive, and makes L2 write hits cheaper, as compared to write-through.
And it makes so little difference in doom, I believe, because the Am5x86's write-back internal cache is providing almost all the benefit of a write-back cache, such that there isn't much harm in making the external cache write-through.