VOGONS


First post, by Disruptor

User metadata
Rank Member
Rank
Member

I've done a small benchmark comparison on my ASUS PCI/I-486SP3G rev. 1.6 with an Intel Saturn II chipset.
Since I could not get CPUs with L1 write back cache to work, I use an AMD 486 DX4 NV8T (write through).
The board supports write back and write through L2.
It has a space for another IC socket near the L2 sockets.
mkarcher has modded the board to add a dirty tag ram socket and modified some traces on the mainboard. We use a CY7C187-15PC (64Kx1) as dirty tag ram.

Benchmark tools are:
SpeedSys 4.78
ctcm 1.7a from heise

write back with dirty tag:

W_DIRTY.png
Filename
W_DIRTY.png
File size
8.37 KiB
Views
261 views
File comment
write back with dirty tag
File license
Public domain

Write Strategy L1 : Write Thru, no Write Allocation, linear Fill,Unknown-LRU
Write Strategy L2 : Write Back, No Write Allocation, L2 Flush (wbinvd)
Dirty Tag L2 : ok
best rate for 8K MOVSD Cache/Page Hit : 125.8 us => 65.1 MByte/s
medium rate for 8K MOVSD (Miss + Hit) : 230.7 us => 35.5 MByte/s
medium rate for 8K MOVSD (L2 clean) : 309.5 us => 26.5 MByte/s
medium rate for 8K MOVSD (L2 dirty) : 466.9 us => 17.5 MByte/s
worst rate for 8K MOVSD (misses) : 716.9 us => 11.4 MByte/s
averaged at 256 KB L2-Cache /DOS (640K) : 266.8 us => 30.7 MByte/s
averaged at 256 KB L2-Cache /Win (4M ) : 359.0 us => 22.8 MByte/s

write back without dirty tag:

WO_DIRTY.png
Filename
WO_DIRTY.png
File size
8.29 KiB
Views
261 views
File comment
write back without dirty tag
File license
Public domain

Write Strategy L1 : Write Thru, no Write Allocation, linear Fill,Unknown-LRU
Write Strategy L2 : Write Back, No Write Allocation, L2 Flush (wbinvd)
Dirty Tag L2 : not detected
best rate for 8K MOVSD Cache/Page Hit : 125.8 us => 65.1 MByte/s
medium rate for 8K MOVSD (Miss + Hit) : 230.3 us => 35.6 MByte/s
medium rate for 8K MOVSD (L2 clean) : 466.7 us => 17.6 MByte/s
medium rate for 8K MOVSD (L2 dirty) : 466.9 us => 17.5 MByte/s
worst rate for 8K MOVSD (misses) : 717.7 us => 11.4 MByte/s
averaged at 256 KB L2-Cache /DOS (640K) : 288.9 us => 28.4 MByte/s
averaged at 256 KB L2-Cache /Win (4M ) : 454.1 us => 18.0 MByte/s

write through:

WR_THRU.png
Filename
WR_THRU.png
File size
8.29 KiB
Views
261 views
File comment
write through
File license
Public domain

Write Strategy L2 : Write Thru, No Write Allocation, L2 Flush (wbinvd)
best rate for 8K MOVSD Cache/Page Hit : 189.1 us => 43.3 MByte/s
medium rate for 8K MOVSD (Miss + Hit) : 231.7 us => 35.4 MByte/s
medium rate for 8K MOVSD (L2 clean) : 498.1 us => 16.4 MByte/s
medium rate for 8K MOVSD (L2 dirty) : 495.8 us => 16.5 MByte/s
worst rate for 8K MOVSD (misses) : 496.1 us => 16.5 MByte/s
averaged at 256 KB L2-Cache /DOS (640K) : 284.8 us => 28.8 MByte/s
averaged at 256 KB L2-Cache /Win (4M ) : 382.1 us => 21.4 MByte/s

Last edited by Disruptor on 2022-01-16, 09:33. Edited 2 times in total.

Reply 1 of 8, by mkarcher

User metadata
Rank Oldbie
Rank
Oldbie

I've taken similar measurements on a Gigabyte GA486IM. That board uses the same UMC chipset (UM8881/UM8886) as the famous Biostar 8433UUD. To highlight the differences between the memory throughput of the boards, I also used a AMD 486DX4 with L1 cache set to write-through.

L2 write-back with dirty tag (the UM8881 can use one of the 8 tag RAM bits as dirty tag, using "Alt in TAG RAM: 7+1" in the advanced chipset setup)

     Transfer in 4 GByte Real Mode, no paging, via CPU Integer Unit
best rate for 8K MOVSD Cache/Page Hit : 126.4 µs => 64.8 MByte/s
medium rate for 8K MOVSD (Miss + Hit) : 229.7 µs => 35.7 MByte/s
medium rate for 8K MOVSD (L2 clean) : 321.8 µs => 25.5 MByte/s
medium rate for 8K MOVSD (L2 dirty) : 490.7 µs => 16.7 MByte/s
worst rate for 8K MOVSD (misses) : 538.5 µs => 15.2 MByte/s

averaged at 256 KB L2-Cache /DOS (640K) : 264.3 µs => 31.0 MByte/s
averaged at 256 KB L2-Cache /Win (4M ) : 318.5 µs => 25.7 MByte/s
WD_8881.png
Filename
WD_8881.png
File size
8.19 KiB
Views
221 views
File license
Public domain

L2 write-back without dirty tag (i.e. "always dirty", BIOS setting: "8+0")

     Transfer in 4 GByte Real Mode, no paging, via CPU Integer Unit
best rate for 8K MOVSD Cache/Page Hit : 125.7 µs => 65.2 MByte/s
medium rate for 8K MOVSD (Miss + Hit) : 229.9 µs => 35.6 MByte/s
medium rate for 8K MOVSD (L2 clean) : 491.0 µs => 16.7 MByte/s
medium rate for 8K MOVSD (L2 dirty) : 490.8 µs => 16.7 MByte/s
worst rate for 8K MOVSD (misses) : 538.3 µs => 15.2 MByte/s

averaged at 256 KB L2-Cache /DOS (640K) : 285.0 µs => 28.7 MByte/s
averaged at 256 KB L2-Cache /Win (4M ) : 400.8 µs => 20.4 MByte/s
WOD_8881.png
Filename
WOD_8881.png
File size
8.19 KiB
Views
221 views
File license
Public domain

L2 write-through

     Transfer in 4 GByte Real Mode, no paging, via CPU Integer Unit
best rate for 8K MOVSD Cache/Page Hit : 126.7 µs => 64.7 MByte/s
medium rate for 8K MOVSD (Miss + Hit) : 260.6 µs => 31.4 MByte/s
medium rate for 8K MOVSD (L2 clean) : 368.9 µs => 22.2 MByte/s
medium rate for 8K MOVSD (L2 dirty) : 369.1 µs => 22.2 MByte/s
worst rate for 8K MOVSD (misses) : 368.9 µs => 22.2 MByte/s

averaged at 256 KB L2-Cache /DOS (640K) : 266.6 µs => 30.7 MByte/s
averaged at 256 KB L2-Cache /Win (4M ) : 314.6 µs => 26.0 MByte/s
WT_8881.png
Filename
WT_8881.png
File size
8.11 KiB
Views
221 views
File license
Public domain

Reply 2 of 8, by mkarcher

User metadata
Rank Oldbie
Rank
Oldbie

These post should help you diagnose poor memory performance on 486 boards. For general use, a configuration is preferable in which the complete memory is cacheable in L2 cache, and the L2 cache operates in write-back mode with a "dirty tag bit". On modern 486 chipsets (like the UMC 8881 or the SiS 496), the dirty tag bit may reside in the tag RAM, whereas older chipsets (like the Intel Saturn or the SiS 411 EISA chipset) need a dedicated chip for the dirty bit. For general purpose, the three different cache modes are, in order of preferability:

  1. Write-back using a dirty bit
  2. Write-through
  3. Write-back assuming always dirty

If you use a chipset that supports borrowing a tag bit as dirty bit, you should be aware that borrowing a tag bit means halving the cacheable area. Most 486 boards use an 8 bit wide tag and direct-mapped L2 cache. In these circumstances, the cacheable area is 256 times the cache size, i.e. 64MB for 256KB L2 cache, 128MB for 512KB cache and 256MB for 1MB L2 cache. If you reassign one of the 8 tag bits to be the dirty bit, 256KB of cache (the most common configuration) is only good for 32MB of RAM. Getting all RAM cached in the L2 cache is generally preferred to enabling write-back cache. If you have the standard configuration of 256KB cache and 8 tag bits (i.e. 9 chips with 28 pins each), use the L2 cache in write-through mode if you install 64MB RAM, unless you know exactly what you are doing, like creating a RAM disk in uncached RAM.

If you use ctcm to diagnose the cache configurations, don't blindly trust the cache strategy detected by ctcm. Instead, detect the strategy yourself. Take a look at the first 5 throughput lines. If all of them are different, your board operates in L1WB mode with proper dirty support. If the 3rd to 5th are all equal, your L2 cache is running in write-through mode. If the 3rd and 4th are identical, but the 5th is lower, you are operating in write-back mode without dirty support. For better performance, you should find a way to either switch to write-through mode or get dirty support.

The technical background: The idea of write-back is that if the CPU writes to some memory cell, and that cell is cached in the L2 cache, the write doesn't go through to memory, but only the L2 cache is updated. This is faster than hitting memory every time. If the data in the L2 cache is updated and the memory in RAM is stale, the part of the L2 cache that has been updated is called "altered" or "dirty" (these terms are synonyms). If some other data is going to be cached where the dirty data is, the dirty data needs to be committed to memory. I don't know about any 486 chipset that writes back dirty cache lines "in the background" when the memory is idle, but write-back from cache to memory happens at the last possible time: Directly before the data in the cache gets replaced. The consequence is: When the CPU reads a memory cell that is not in the L2 cache, and the part of the L2 cache where that data is going to be stored, the CPU is stalled until the dirty part of the L2 cache (16 bytes) got written to memory and the requested data is subsequently copied into the L2 cache and forwarded to the CPU. A memory access that misses the L2 cache thus is significantly slower in a write-back system where the L2 cache is dirty than in a write-through system (in which the L2 is always clean). You probably can already estimate the performance disaster that will arise if a write-back L2 system doesn't store whether a cache part is dirty or not: It always has to assume the L2 cache is dirty, and also issues write-back cycles when the L2 cache is actually clean. And as I just discussed, L2 write-back cycles hurt because they stall the CPU as it waits for the data it tries to read. A properly operating L2 write-back system is supposed to make up for the performance loss caused by the write-back cycles by being faster on write hits into L2 cache.

Knowing the background, we can take a look at the speedsys diagrams and understand why these diagrams might be misleading if you don't exactly understand what situation is measured. Most prominently, on all 486 sytems, the "write" performance in speedsys is a straight horizontal line, even if write-back cache is enabled. It seems as if the write-back cache isn't helping writing at all. This is an artifact caused by the way speedsys performs the measurement: Speedsys allocates a 4MB area, and aligns all tests to the beginning. That means the 4MB read test reads the complete area from start to end. At the end of the read test, the last 256KB of the test area are cached in the L2 cache (assuming 256KB cache). The write test the writes to the beginning of the 4MB test area which is not in the cache, so all the writes are misses. That explains why you don't see any effect of the cache in the speedsys write benchmark. This is a rare case in real-world applications, though. Real-world applications often write into memory locations that were read just before (like incrementing a variable), which would cause a write hit. Speedsys is thus unable to detect the performance improvement a write-back L2 cache can have to write cycles.

The speedsys "moving" test on the other hand seems to read the memory and copy it back to the same address. This means that all the write cycles are hits now, and as long as the test area is fully backed by the L2 cache, no write cycles to memory happen at all in the write-back scenario. This is the best case for a write-back cache, and as no write back cycles happen, the speedsys diagram looks the same with and without dirty tag support as long as the L2 cache size is not exceeded. On the other hand, as soon as the L2 cache size is exceeded, the presence or absence of the dirty bit (per cache line) is clearly visible: If every line is assumed to be dirty, all the stuff read from main memory into the L2 cache will need to be written back to memory again, so the amount of data transferred between memory and L2 cache is identical in the "reading" and the "moving" cache. If the dirty tag bit is present, though, the chipset will omit the write back cycles in the "reading" bench and only perform them in the "moving" bench. This explains why the slow part (where L2 size is exceeded) has the read performance drop to the move performance if the dirty bit is missing.

There are some further interesting insights you can take from the diagrams: The moving bench (for block size > L2 size) performs better in L2 WT mode than in L2 WB mode. This is because in L2 WT mode, the writes to main memory happen as soon as possible, so the memory bus is never idle, whereas in L2 WB mode, writeback happens when the CPU starts accessing a new cache line (stalling the CPU), then the new line is fetched to L2 cache and L1 cache. As soon as the cache line is completely filled, the CPU starts writing the new contents for the cache line to L2 cache (which is not yet written to memory, because that's the point of L2WB). The memory bus is thus idle during the time the CPU needs to write the updated data back to L2 cache. Only when the CPU starts to require the next line, a new write-back cycle from L2 cache to memory is initiated. As the bottleneck of the "moving" bench for big blocks is the memory bus, every cycle the memory bus is idle decreases the score in this benchmark.

Finally, if your board is slow enough on memory writes that the memory writes are a bottleneck for the "moving" bench (which is the case for the Saturn II board, but not for the UM8881 board), you can see a significant improvement in the "moving" score for small blocks if L2WB is enabled (with or without dirty tag). In the "moving" bench for small blocks, all writes are hits, and all memory writes are eliminated.

Reply 3 of 8, by Disruptor

User metadata
Rank Member
Rank
Member

In comparison between the two boards, the Saturn II (Intel 82420ZX) based ASUS PCI/I-486SP3G against the UMC8881 based Gigabyte GA486IM, I can say that the UMC8881' write performance is near the optimum of a 486 but the Saturn II has one wait state.
However, the Saturn II has an additional write buffer that is not covered by these benchmarks.

mkarcher has modified my Saturn II board again, so that I can run my write back processors in write through mode.
The Saturn II chipset is from the earlier 486 ages and should have L1 write back capability, but it is not implemented in my board.
The UMC 8881 is from the latter 486 ages supporting all features from Socket 3 processors.

Reply 4 of 8, by mpe

User metadata
Rank Oldbie
Rank
Oldbie

Just as mkarcher says. I find it tricky to compare differences between cache write modes using speedy or speeds or similar microbenchmarks.

WT / WB with or without tag is all about trade-offs. The CPUs is less blocked by memory writes.The cost we have to pay for this is (apart from implementation complexity) is extra time for certain reads (whenever a reading operation triggers a write back cycle). Chances are that for typical software this is perfectly worth it as the CPU can execute instructions during time it would otherwise had to wait to complete write. So the saved time during writes can easily make up for "wasted" time at writeback cycles.

However, speedsys or likes doesn't actually do any software computations, it is just moving bytes in RAM. Thus it can never actually benefit from the write-back architecture. On the other hand it is almost always affected by cons of the write-back, especially when no dirty tag is used and the CPU has to flush the cache even more often.

I'd run something real-world like Quake, which does seem to benefit a lot from great write performance and is reasonably complex for a 486 CPU. In Quake benefits of WB vs WT should be visible (despite nominally slower speedsys scores).

Blog|NexGen 586|S4

Reply 5 of 8, by pshipkov

User metadata
Rank Oldbie
Rank
Oldbie

That is the theory and you are correct.
However some boards deliver better perf in WT mode.
At the same time other ones based on the same chipset do better in WB.
It does not feel like flawed wb support since using dirty bit or not makes a difference which indicates that wb+db is handled.

retro bits and bytes

Reply 6 of 8, by mkarcher

User metadata
Rank Oldbie
Rank
Oldbie
mpe wrote on 2022-01-16, 15:05:

WT / WB with or without tag is all about trade-offs. The CPUs is less blocked by memory writes.The cost we have to pay for this is (apart from implementation complexity) is extra time for certain reads (whenever a reading operation triggers a write back cycle). Chances are that for typical software this is perfectly worth it as the CPU can execute instructions during time it would otherwise had to wait to complete write. So the saved time during writes can easily make up for "wasted" time at writeback cycles.

This is true. Yet you don't necessarily need to have write-back cache to avoid delays caused by occasional write cycles: The 486-class CPUs use write buffering. The write buffer of the 486 processor is a queue with four entries that can take writes "immediately" and pushes the writes out to the mainboard as soon as possible. The 486 processor only needs to wait for writes to complete if the write buffer is full (i.e. four write cycles are pending), or if a read that misses the L1 cache happens (the processor often flushes the the write buffer before allowing the read to ensure consistency, as the 486 does not contain logic to detect whether a read cycle accesses an address that is pending in the write buffer).

Reply 7 of 8, by mpe

User metadata
Rank Oldbie
Rank
Oldbie
pshipkov wrote on 2022-01-16, 16:05:

However some boards deliver better perf in WT mode.
At the same time other ones based on the same chipset do better in WB.
It does not feel like flawed wb support since using dirty bit or not makes a difference which indicates that wb+db is handled.

Yes. Not all WB implementations are equal. Apart from with/without dirty tag there are also implementations that delay the re-fill until after writeback and those that can make use of external buffer and execute them in parallel (concurrent writeback). Obviously the latter is much better.

Also since the write-back generally decreases temporal and spatial locality of bus traffic, some chipsets might be tuned to prioritise page hits performance with greater penalty for a page miss (such as by holding a page open for longer speculatively) and some the other way around. The former tuning might prove to be less beneficial when wb cache is used.

It is a multi-facet problem. But if everything is implemented right, WB should improve performance when running more advanced sw. That's why it's been created (despite all the hassle). My main point was that speedsys might not be the best tool for comparing cache architecture differences, as the one with lower score might still be faster.

Blog|NexGen 586|S4

Reply 8 of 8, by pshipkov

User metadata
Rank Oldbie
Rank
Oldbie

Right on.
In addition - on software level - compilers extend coverage so much these days, but still:
Vectorization
Data/cache alignment
Data caching / lazy computation
Stack allocations when possible
Stay on the cached data for as long as possible
Data prefetching
And few others that I cannot think at the moment ...

retro bits and bytes