VOGONS


First post, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

So, how is the 386 L2 (on board) cache designed to work? It seems to be WT. There are however other questions:

1) How may ways associative? How big is a cache block?
2) When a WT happens, does the processor (BIU/EU) continue execution or it stalls until the BIU wrote everything to main memory? I assume the latter.
3) When a read hit happens, is that effectively a 1-cycle read per...byte? dword?

I've been looking around but I cannot find anything conclusive. If anyone has any documentation pointers please post them.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 2 of 11, by superfury

User metadata
Rank l33t++
Rank
l33t++

I'd assume it's a 1 cycle read for a cached 32-bit block of data? So any byte, word or dword entry that's in the same range of the cache entry(e.g. address 0x12345678-0x1234567B) gets updated by any write to that address range, with reads after such a write reading directly from the cache itself, instead of main memory? Although that would be a problem if something's written without it having been read first(e.g. writing to address 0x12345678 without having read from 0x12345679-0x1234567B)? Not too sure about that, though? Maybe some kind of 4-bit bitmask to allow parts to be written back instead of the entire dword being written to memory or read?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 11, by SarahWalker

User metadata
Rank Member
Rank
Member

For what it's worth, the only 386 cache design I've looked is for the OPTi495SX. That has (with 32k cache and 8k tag) the following characteristics, at 40 MHz, with default timings as provided by AMIBIOS :

  • Cache line length of 4 bytes
  • Direct mapped
  • Write back
  • Cache hit (read) takes 3 cycles
  • Cache miss (read) takes ~14 cycles. Rough guess of how this breaks down - 2 cycles tag read, 7 cycles DRAM read, 2 cycles cache write, 3 cycles cache data write. I haven't actually put a scope on this board, so this is unlikely to be exactly right
  • Cache hit (read-modify-write) takes 5 cycles
  • I didn't actually test cache hit (write). Probably 2 or 3 cycles
  • Cache miss (write) takes 7 cycles. I don't entirely understand why this is so much faster than the read - again, scope would help

This is probably fairly typical of late gen 386 boards. I've no idea which cache is being referred to in the initial post, so can't say how similar that would be.

Reply 4 of 11, by superfury

User metadata
Rank l33t++
Rank
l33t++

Does that really improve performance? Afaik the 386 takes 2 cycles to access memory, with 1 cycle added due to Waitstate RAM? So 3 cycles in total? That's the same az the minimal cache hit timing you're giving (Cache Hit(read))? Then why are those caches added if they only add to RAM timings? And it's usually the reads that are done the most(variables and code)? Sounds counter-productive?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 11, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/html/ch11/11-02.html

Few things to slow down the theoretical 3 cycles per 32bit transfer that a 386DX should be able to do:

- My guess is 1WS memory was not that common but 2-3 was.
- the DRAM refresh seems to steal cycles. I do not know how much and when exactly.
- my other guess (based on Abrash) is that the number of WS varies with this: "which interleaved bank and/or RAM column was accessed last."

I think some more knowledge of the DRAM architecture is required to fully understand this...

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 6 of 11, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
SarahWalker wrote:

For what it's worth, the only 386 cache design I've looked is for the OPTi495SX. That has (with 32k cache and 8k tag) the following characteristics, at 40 MHz, with default timings as provided by AMIBIOS

Also, thank you Sarah for that data. It is very useful!

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 7 of 11, by SarahWalker

User metadata
Rank Member
Rank
Member
vladstamate wrote:
I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/h […]
Show full quote

I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/html/ch11/11-02.html

Few things to slow down the theoretical 3 cycles per 32bit transfer that a 386DX should be able to do:

- My guess is 1WS memory was not that common but 2-3 was.
- the DRAM refresh seems to steal cycles. I do not know how much and when exactly.
- my other guess (based on Abrash) is that the number of WS varies with this: "which interleaved bank and/or RAM column was accessed last."

I think some more knowledge of the DRAM architecture is required to fully understand this...

It all gets very messy, particularly as CPU clock speeds increased!

I've just dug out the PS/2 model 70 reference manual - this is a 386DX system with 16, 20 and 25 MHz versions, with varying memory systems across the three. The 16 MHz version is using fast page mode RAM with no cache, with the below performance :

  • Memory Read (Page Hit) - 0 wait states
  • Memory Read (Page Miss) - 2 wait states
  • Memory Write (Page Hit) - 1 wait state
  • Memory Write (Page Miss) - 2 wait states

They don't state what a page is, on a 1 MB system with a 32-bit data bus this will probably be implemented as 256kx32, meaning 18 address bits. This would make a page 9 bits + 2 to account for the 32-bit bus, giving a page size of 2 KB.

The 20 MHz version uses the same timings, but presumably needs faster DRAM to achieve them.

The 25 MHz version adds a 64kb cache, with the below timings :

  • Memory Read (Cache Hit) - 0 wait states
  • Memory Write (Cache Hit) - 0 wait states
  • Memory Read (Cache Miss, Page Hit) - 0-2 wait states
  • Memory Write (Cache Miss, Page Hit) - 1 wait state
  • Memory Read (Cache Miss, Page Miss) - 3-5 wait state
  • Memory Write (Cache Miss, Page Miss) - 0 wait states

The manual states that writes are buffered on the motherboard, hence the odd write timings. I suspect my 386DX/40 board is doing something similar. It doesn't say what the variations in the read timings are caused by, which is extremely helpful from an emulation perspective.

Refresh for all three is given as performed every 15.1us, with delays from 500 to 600 ns - this would work out as roughly 5-7 cycles.

Reply 8 of 11, by SarahWalker

User metadata
Rank Member
Rank
Member

Another example would be a board I have lying around, an ECS 386/32. This is a 386DX/20 board using a C&T CS8230 chipset, which implements interleaved memory. From what I can tell, this gives 1 W/S in normal operation, but if two identical banks of memory are available then sequential accesses will drop to 0 W/S. I can't say how this performs in practice as I don't have enough SIPPs to test it!

On the other end of the performance scale are a couple of 386SX boards, a 386SX/25 which runs at 2 W/S and a 386SX/40 which runs at 5 W/S. Once you account for the clock speeds this gives both boards roughly the same memory speed. Unsurprisingly there's very little performance difference between the two! I believe this is typical of most 386SX boards - there are a few cached boards that would be noticeably faster but most will run fairly basic DRAM controllers with fairly conservative timings.

Reply 9 of 11, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
SarahWalker wrote:

Refresh for all three is given as performed every 15.1us, with delays from 500 to 600 ns - this would work out as roughly 5-7 cycles.

Good information. Some back of the napkin calculations (if I understand this correctly):

A cycle for a 386@40Mhz would last 25ns. Or 62ns for a 386 at 16Mhz.

If rounded at 15us, for a 386 @40Mhz that would mean roughly every 600 cycles it needs a DRAM refresh. At 500ns per refresh that is 20 cycles.During which we cannot access some (or all) of the memory.

For a 386 at 16Mhz the refresh would last about 8 cycles and happen every 242 cycles.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 10 of 11, by SarahWalker

User metadata
Rank Member
Rank
Member

That sounds more or less correct - and given the sheer number of 386 motherboard designs that exist, 'more or less' is about as accurate as you're going to get!

Note though, that on cached systems refresh won't stall the CPU unless it is actively accessing DRAM at that point. If it's hitting the cache at that point then refresh will be unnoticeable.

Also note that some (later) designs may be implementing hidden refresh, in which case the refresh will be unnoticeable even during a DRAM access.

Reply 11 of 11, by SarahWalker

User metadata
Rank Member
Rank
Member
superfury wrote:

I'd assume it's a 1 cycle read for a cached 32-bit block of data? So any byte, word or dword entry that's in the same range of the cache entry(e.g. address 0x12345678-0x1234567B) gets updated by any write to that address range, with reads after such a write reading directly from the cache itself, instead of main memory? Although that would be a problem if something's written without it having been read first(e.g. writing to address 0x12345678 without having read from 0x12345679-0x1234567B)? Not too sure about that, though? Maybe some kind of 4-bit bitmask to allow parts to be written back instead of the entire dword being written to memory or read?

No, in general most (all?) motherboard-based cache designs will ignore writes to addresses that aren't already in the cache - the write will go straight to main memory.

CPUs starting from the Pentium Pro generation implement 'write-allocate', where a write to an address not already in cache will cause the relevant cache line to be read in from memory. That's not really relevant for the 386 generation though.