VOGONS


Pentium MMX 450MHz

Topic actions

Reply 40 of 45, by ph4nt0m

User metadata
Rank Member
Rank
Member

Maybe Tillamook support has something to do with the base BIOS version? FIC VA-503+ as well as PA-2012 and PA-2013 use Award 4.60PG while most other Socket 7 boards run 4.51PG.

I have got a batch of Tillamooks from Chinese scrappers. The best one does 450MHz @ 2.3V which is way better than 2.8V.

My Active Sales on CPU-World

Reply 41 of 45, by mkarcher

User metadata
Rank l33t
Rank
l33t
ph4nt0m wrote on 2020-04-21, 14:22:
Just for the record, screen shots of K6-2 and K6-3 running on the same mainboard with the same settings. […]
Show full quote

Just for the record, screen shots of K6-2 and K6-3 running on the same mainboard with the same settings.

write performance in MB/s

             K6-2     K6-3   Tillamook
L1 MMX 2425.74 2489.26 169.60
L2 MMX 245.39 1248.84 168.95
L3 MMX 245.34
Mem MMX 110.00 144.89 169.21

BTW why write performance is such BS on many older CPUs including Pentium MMX? Cannot be true with write back caches.

Write allocation. The way the memory speed test in speedsys works causes the write test to be (nearly) 100% cache misses, so it tests write performance to the memory all the time, and the caches are not used at all. Newer processors added a strategy called "write allocation" that makes the access pattern used by speedsys cachable.

Standard cache behaviour of both write-back caches is that they just write through on cache miss. The last thing speedsys does before doing the write performance tests is reading really big memory block (I think 4MB). The caches have the end of that 4MB block cached. If you have 1M L3, 256K L2 and 32K L1, the last 32K of the 4MB block used by speedsys are cached in L1, the last 256K are cached in L2, and the last megabyte is cached in L3. This means: the first three megabytes of the 4MB cache area is not in any of the cache levels. So when speedsys starts doing its write performance tests at the start of its 4MB memory area, you get cache misses all the time. As the write test of speedsys is not reading anything, the cached addresses do not change during the whole write test. That's why you get plain memory performance on write tests.

The strategy "write allocation" employed by modern processor causes a processor to load a cache line (32 Bytes) from memory into the L1 cache when a write misses the L1 cache. The L1 load causes the L2 (and L3 if present) cache to be loaded too. This means that with write allocation, the beginning of the speedsys test area is in the cache, and now you are testing write-back hit performance instead of write-back miss performance.

Write allocation is a trade-off: It is based on the assumption that a write into some memory area is likely to be followed by other writes and reads into the same memory area. If this assumption is correct, the additional cost of loading a cache line on write miss is more than payed off by the follow-up accesses being hits. If the assumptions turns out to be false on the other hand, loading the cache line that was written to does not have any performance advantage, but you still have to pay the costs: You evict other data from the cache that might still be hot (i.e. used in the near future) and you waste FSB bandwidth by unnecessarily loading a cache line (which might stall further writes to unrelated cache lines). This is why SSE includes "non-temporal store operations". These instructions cause writes to not perform write allocation even on systems that otherwise would perform write allocation. The programmer can thus hint the processor into the optimal cache strategy for the access pattern the code is going to have.

One says that on typical CPU usage, write-allocation is indeed the peferrable strategy that yields 5 to 10 percent performance advantage, but just as speedsys has a worst-case access pattern for the no-write-allocation strategy, other code might contain a pattern that performs much worse with write allocation than it performs without write allocation. In fact, this would happen if code is consistently reading a memory area that is around the cache size, but scatter writes randomly over a much bigger memory range. I don't know of any typical example algorithms that have an access pattern like this, though.

Reply 42 of 45, by ph4nt0m

User metadata
Rank Member
Rank
Member
mkarcher wrote on 2020-06-22, 23:27:
ph4nt0m wrote on 2020-04-21, 14:22:
Just for the record, screen shots of K6-2 and K6-3 running on the same mainboard with the same settings. […]
Show full quote

Just for the record, screen shots of K6-2 and K6-3 running on the same mainboard with the same settings.

write performance in MB/s

             K6-2     K6-3   Tillamook
L1 MMX 2425.74 2489.26 169.60
L2 MMX 245.39 1248.84 168.95
L3 MMX 245.34
Mem MMX 110.00 144.89 169.21

BTW why write performance is such BS on many older CPUs including Pentium MMX? Cannot be true with write back caches.

Write allocation. The way the memory speed test in speedsys works causes the write test to be (nearly) 100% cache misses, so it tests write performance to the memory all the time, and the caches are not used at all. Newer processors added a strategy called "write allocation" that makes the access pattern used by speedsys cachable.

Standard cache behaviour of both write-back caches is that they just write through on cache miss. The last thing speedsys does before doing the write performance tests is reading really big memory block (I think 4MB). The caches have the end of that 4MB block cached. If you have 1M L3, 256K L2 and 32K L1, the last 32K of the 4MB block used by speedsys are cached in L1, the last 256K are cached in L2, and the last megabyte is cached in L3. This means: the first three megabytes of the 4MB cache area is not in any of the cache levels. So when speedsys starts doing its write performance tests at the start of its 4MB memory area, you get cache misses all the time. As the write test of speedsys is not reading anything, the cached addresses do not change during the whole write test. That's why you get plain memory performance on write tests.

Seems so. It should read the allocated memory block before writing to it continuously in order to evaluate performance. Makes no difference for write through caches, but should hit the same dirty lines all the time in case of write back.

My Active Sales on CPU-World

Reply 43 of 45, by amadeus777999

User metadata
Rank Oldbie
Rank
Oldbie
mkarcher wrote on 2020-06-22, 23:27:
Write allocation. The way the memory speed test in speedsys works causes the write test to be (nearly) 100% cache misses, so it […]
Show full quote
ph4nt0m wrote on 2020-04-21, 14:22:
Just for the record, screen shots of K6-2 and K6-3 running on the same mainboard with the same settings. […]
Show full quote

Just for the record, screen shots of K6-2 and K6-3 running on the same mainboard with the same settings.

write performance in MB/s

             K6-2     K6-3   Tillamook
L1 MMX 2425.74 2489.26 169.60
L2 MMX 245.39 1248.84 168.95
L3 MMX 245.34
Mem MMX 110.00 144.89 169.21

BTW why write performance is such BS on many older CPUs including Pentium MMX? Cannot be true with write back caches.

Write allocation. The way the memory speed test in speedsys works causes the write test to be (nearly) 100% cache misses, so it tests write performance to the memory all the time, and the caches are not used at all. Newer processors added a strategy called "write allocation" that makes the access pattern used by speedsys cachable.

Standard cache behaviour of both write-back caches is that they just write through on cache miss. The last thing speedsys does before doing the write performance tests is reading really big memory block (I think 4MB). The caches have the end of that 4MB block cached. If you have 1M L3, 256K L2 and 32K L1, the last 32K of the 4MB block used by speedsys are cached in L1, the last 256K are cached in L2, and the last megabyte is cached in L3. This means: the first three megabytes of the 4MB cache area is not in any of the cache levels. So when speedsys starts doing its write performance tests at the start of its 4MB memory area, you get cache misses all the time. As the write test of speedsys is not reading anything, the cached addresses do not change during the whole write test. That's why you get plain memory performance on write tests.

The strategy "write allocation" employed by modern processor causes a processor to load a cache line (32 Bytes) from memory into the L1 cache when a write misses the L1 cache. The L1 load causes the L2 (and L3 if present) cache to be loaded too. This means that with write allocation, the beginning of the speedsys test area is in the cache, and now you are testing write-back hit performance instead of write-back miss performance.

Write allocation is a trade-off: It is based on the assumption that a write into some memory area is likely to be followed by other writes and reads into the same memory area. If this assumption is correct, the additional cost of loading a cache line on write miss is more than payed off by the follow-up accesses being hits. If the assumptions turns out to be false on the other hand, loading the cache line that was written to does not have any performance advantage, but you still have to pay the costs: You evict other data from the cache that might still be hot (i.e. used in the near future) and you waste FSB bandwidth by unnecessarily loading a cache line (which might stall further writes to unrelated cache lines). This is why SSE includes "non-temporal store operations". These instructions cause writes to not perform write allocation even on systems that otherwise would perform write allocation. The programmer can thus hint the processor into the optimal cache strategy for the access pattern the code is going to have.

One says that on typical CPU usage, write-allocation is indeed the peferrable strategy that yields 5 to 10 percent performance advantage, but just as speedsys has a worst-case access pattern for the no-write-allocation strategy, other code might contain a pattern that performs much worse with write allocation than it performs without write allocation. In fact, this would happen if code is consistently reading a memory area that is around the cache size, but scatter writes randomly over a much bigger memory range. I don't know of any typical example algorithms that have an access pattern like this, though.

Very interesting!
I only came across the non-temporal store when I converted the upscaler of chocolate doom to use the 16 byte wide registers to duplicate bytes and write the finished frame buffer scaled up by integer multiples. It was faster then wr-alloc even though I was writing consecutive 16byte chunks... but I would have to dig this out and test again to make sure.

Reply 44 of 45, by ph4nt0m

User metadata
Rank Member
Rank
Member
amadeus777999 wrote on 2020-06-23, 17:54:

Very interesting!
I only came across the non-temporal store when I converted the upscaler of chocolate doom to use the 16 byte wide registers to duplicate bytes and write the finished frame buffer scaled up by integer multiples. It was faster then wr-alloc even though I was writing consecutive 16byte chunks... but I would have to dig this out and test again to make sure.

Non-temporal stores are specific to MMX/SSE. There are special instructions for this purpose. Useful for multimedia related stuff or just operations on large memory arrays polluting caches.

My Active Sales on CPU-World

Reply 45 of 45, by Nemo1985

User metadata
Rank Oldbie
Rank
Oldbie
meljor wrote on 2020-06-22, 19:39:

It was the 5.1 revision of the board.

I tested and I confirm that the cpu works fine with Gigabyte GA-5AX (rev. 5.2), with the ads#-adsc# bridge the l2 cache is detected, with vanilla cpu only the l1 cache works, the bios is Award 4.51PG
Asus p55t2p4 won't work the debug card shows 0000, I use the modded bios for k6-3+ and it is the Award rev 4.51PG.
I also tested the Asus P5A with latest beta bios, it doesn't boot too.
So far I found 2 mb where the tillamook works fine: Gigabyte GA-5AX (rev. 5.2), Fic 2003VA+ and the Aopen AX59 PRO but the behaviour is erratic, further test in future.