386 Cache Design

Emulation of old PCs, PC hardware, or PC peripherals.

386 Cache Design

Postby vladstamate » 2017-10-10 @ 00:32

So, how is the 386 L2 (on board) cache designed to work? It seems to be WT. There are however other questions:

1) How may ways associative? How big is a cache block?
2) When a WT happens, does the processor (BIU/EU) continue execution or it stalls until the BIU wrote everything to main memory? I assume the latter.
3) When a read hit happens, is that effectively a 1-cycle read per...byte? dword?

I've been looking around but I cannot find anything conclusive. If anyone has any documentation pointers please post them.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: 386 Cache Design

Postby SarahWalker » 2017-10-10 @ 16:44

There is no standard 386 L2 cache design - it's entirely implemented by the motherboard chipset. Ergo it's going to vary from board to board.
SarahWalker
Newbie
 
Posts: 45
Joined: 2016-5-12 @ 17:07

Re: 386 Cache Design

Postby superfury » 2017-10-12 @ 16:40

I'd assume it's a 1 cycle read for a cached 32-bit block of data? So any byte, word or dword entry that's in the same range of the cache entry(e.g. address 0x12345678-0x1234567B) gets updated by any write to that address range, with reads after such a write reading directly from the cache itself, instead of main memory? Although that would be a problem if something's written without it having been read first(e.g. writing to address 0x12345678 without having read from 0x12345679-0x1234567B)? Not too sure about that, though? Maybe some kind of 4-bit bitmask to allow parts to be written back instead of the entire dword being written to memory or read?
superfury
l33t
 
Posts: 2048
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: 386 Cache Design

Postby SarahWalker » 2017-10-12 @ 21:09

For what it's worth, the only 386 cache design I've looked is for the OPTi495SX. That has (with 32k cache and 8k tag) the following characteristics, at 40 MHz, with default timings as provided by AMIBIOS :

  • Cache line length of 4 bytes
  • Direct mapped
  • Write back
  • Cache hit (read) takes 3 cycles
  • Cache miss (read) takes ~14 cycles. Rough guess of how this breaks down - 2 cycles tag read, 7 cycles DRAM read, 2 cycles cache write, 3 cycles cache data write. I haven't actually put a scope on this board, so this is unlikely to be exactly right
  • Cache hit (read-modify-write) takes 5 cycles
  • I didn't actually test cache hit (write). Probably 2 or 3 cycles
  • Cache miss (write) takes 7 cycles. I don't entirely understand why this is so much faster than the read - again, scope would help

This is probably fairly typical of late gen 386 boards. I've no idea which cache is being referred to in the initial post, so can't say how similar that would be.
SarahWalker
Newbie
 
Posts: 45
Joined: 2016-5-12 @ 17:07

Re: 386 Cache Design

Postby superfury » 2017-10-12 @ 21:43

Does that really improve performance? Afaik the 386 takes 2 cycles to access memory, with 1 cycle added due to Waitstate RAM? So 3 cycles in total? That's the same az the minimal cache hit timing you're giving (Cache Hit(read))? Then why are those caches added if they only add to RAM timings? And it's usually the reads that are done the most(variables and code)? Sounds counter-productive?
superfury
l33t
 
Posts: 2048
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: 386 Cache Design

Postby vladstamate » 2017-10-13 @ 17:20

I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/h ... 11-02.html

Few things to slow down the theoretical 3 cycles per 32bit transfer that a 386DX should be able to do:

- My guess is 1WS memory was not that common but 2-3 was.
- the DRAM refresh seems to steal cycles. I do not know how much and when exactly.
- my other guess (based on Abrash) is that the number of WS varies with this: "which interleaved bank and/or RAM column was accessed last."

I think some more knowledge of the DRAM architecture is required to fully understand this...
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: 386 Cache Design

Postby vladstamate » 2017-10-13 @ 17:22

SarahWalker wrote:For what it's worth, the only 386 cache design I've looked is for the OPTi495SX. That has (with 32k cache and 8k tag) the following characteristics, at 40 MHz, with default timings as provided by AMIBIOS


Also, thank you Sarah for that data. It is very useful!
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: 386 Cache Design

Postby SarahWalker » 2017-10-13 @ 18:20

vladstamate wrote:I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/h ... 11-02.html

Few things to slow down the theoretical 3 cycles per 32bit transfer that a 386DX should be able to do:

- My guess is 1WS memory was not that common but 2-3 was.
- the DRAM refresh seems to steal cycles. I do not know how much and when exactly.
- my other guess (based on Abrash) is that the number of WS varies with this: "which interleaved bank and/or RAM column was accessed last."

I think some more knowledge of the DRAM architecture is required to fully understand this...

It all gets very messy, particularly as CPU clock speeds increased!

I've just dug out the PS/2 model 70 reference manual - this is a 386DX system with 16, 20 and 25 MHz versions, with varying memory systems across the three. The 16 MHz version is using fast page mode RAM with no cache, with the below performance :
  • Memory Read (Page Hit) - 0 wait states
  • Memory Read (Page Miss) - 2 wait states
  • Memory Write (Page Hit) - 1 wait state
  • Memory Write (Page Miss) - 2 wait states
They don't state what a page is, on a 1 MB system with a 32-bit data bus this will probably be implemented as 256kx32, meaning 18 address bits. This would make a page 9 bits + 2 to account for the 32-bit bus, giving a page size of 2 KB.

The 20 MHz version uses the same timings, but presumably needs faster DRAM to achieve them.

The 25 MHz version adds a 64kb cache, with the below timings :
  • Memory Read (Cache Hit) - 0 wait states
  • Memory Write (Cache Hit) - 0 wait states
  • Memory Read (Cache Miss, Page Hit) - 0-2 wait states
  • Memory Write (Cache Miss, Page Hit) - 1 wait state
  • Memory Read (Cache Miss, Page Miss) - 3-5 wait state
  • Memory Write (Cache Miss, Page Miss) - 0 wait states
The manual states that writes are buffered on the motherboard, hence the odd write timings. I suspect my 386DX/40 board is doing something similar. It doesn't say what the variations in the read timings are caused by, which is extremely helpful from an emulation perspective.

Refresh for all three is given as performed every 15.1us, with delays from 500 to 600 ns - this would work out as roughly 5-7 cycles.
SarahWalker
Newbie
 
Posts: 45
Joined: 2016-5-12 @ 17:07

Re: 386 Cache Design

Postby SarahWalker » 2017-10-13 @ 18:27

Another example would be a board I have lying around, an ECS 386/32. This is a 386DX/20 board using a C&T CS8230 chipset, which implements interleaved memory. From what I can tell, this gives 1 W/S in normal operation, but if two identical banks of memory are available then sequential accesses will drop to 0 W/S. I can't say how this performs in practice as I don't have enough SIPPs to test it!

On the other end of the performance scale are a couple of 386SX boards, a 386SX/25 which runs at 2 W/S and a 386SX/40 which runs at 5 W/S. Once you account for the clock speeds this gives both boards roughly the same memory speed. Unsurprisingly there's very little performance difference between the two! I believe this is typical of most 386SX boards - there are a few cached boards that would be noticeably faster but most will run fairly basic DRAM controllers with fairly conservative timings.
SarahWalker
Newbie
 
Posts: 45
Joined: 2016-5-12 @ 17:07

Re: 386 Cache Design

Postby vladstamate » 2017-10-13 @ 18:44

SarahWalker wrote:Refresh for all three is given as performed every 15.1us, with delays from 500 to 600 ns - this would work out as roughly 5-7 cycles.


Good information. Some back of the napkin calculations (if I understand this correctly):

A cycle for a 386@40Mhz would last 25ns. Or 62ns for a 386 at 16Mhz.

If rounded at 15us, for a 386 @40Mhz that would mean roughly every 600 cycles it needs a DRAM refresh. At 500ns per refresh that is 20 cycles.During which we cannot access some (or all) of the memory.

For a 386 at 16Mhz the refresh would last about 8 cycles and happen every 242 cycles.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: 386 Cache Design

Postby SarahWalker » 2017-10-13 @ 18:55

That sounds more or less correct - and given the sheer number of 386 motherboard designs that exist, 'more or less' is about as accurate as you're going to get!

Note though, that on cached systems refresh won't stall the CPU unless it is actively accessing DRAM at that point. If it's hitting the cache at that point then refresh will be unnoticeable.

Also note that some (later) designs may be implementing hidden refresh, in which case the refresh will be unnoticeable even during a DRAM access.
SarahWalker
Newbie
 
Posts: 45
Joined: 2016-5-12 @ 17:07

Re: 386 Cache Design

Postby SarahWalker » 2017-10-13 @ 18:58

superfury wrote:I'd assume it's a 1 cycle read for a cached 32-bit block of data? So any byte, word or dword entry that's in the same range of the cache entry(e.g. address 0x12345678-0x1234567B) gets updated by any write to that address range, with reads after such a write reading directly from the cache itself, instead of main memory? Although that would be a problem if something's written without it having been read first(e.g. writing to address 0x12345678 without having read from 0x12345679-0x1234567B)? Not too sure about that, though? Maybe some kind of 4-bit bitmask to allow parts to be written back instead of the entire dword being written to memory or read?

No, in general most (all?) motherboard-based cache designs will ignore writes to addresses that aren't already in the cache - the write will go straight to main memory.

CPUs starting from the Pentium Pro generation implement 'write-allocate', where a write to an address not already in cache will cause the relevant cache line to be read in from memory. That's not really relevant for the 386 generation though.
SarahWalker
Newbie
 
Posts: 45
Joined: 2016-5-12 @ 17:07


Return to PC Emulation

Who is online

Users browsing this forum: No registered users and 1 guest