386 Cache Design \ VOGONS

Reply 1 of 11, by SarahWalker

Posted on 2017-10-10, 16:44

SarahWalker Offline

Rank Member

Rank: Member
Posts: 262
Joined: 2007-08-19, 10:51

There is no standard 386 L2 cache design - it's entirely implemented by the motherboard chipset. Ergo it's going to vary from board to board.

Reply 2 of 11, by superfury

Posted on 2017-10-12, 16:40

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5458
Joined: 2014-03-08, 11:25
Location: Netherlands

I'd assume it's a 1 cycle read for a cached 32-bit block of data? So any byte, word or dword entry that's in the same range of the cache entry(e.g. address 0x12345678-0x1234567B) gets updated by any write to that address range, with reads after such a write reading directly from the cache itself, instead of main memory? Although that would be a problem if something's written without it having been read first(e.g. writing to address 0x12345678 without having read from 0x12345679-0x1234567B)? Not too sure about that, though? Maybe some kind of 4-bit bitmask to allow parts to be written back instead of the entire dword being written to memory or read?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 11, by SarahWalker

Posted on 2017-10-12, 21:09

SarahWalker Offline

Rank Member

Rank: Member
Posts: 262
Joined: 2007-08-19, 10:51

For what it's worth, the only 386 cache design I've looked is for the OPTi495SX. That has (with 32k cache and 8k tag) the following characteristics, at 40 MHz, with default timings as provided by AMIBIOS :

Cache line length of 4 bytes
Direct mapped
Write back
Cache hit (read) takes 3 cycles
Cache miss (read) takes ~14 cycles. Rough guess of how this breaks down - 2 cycles tag read, 7 cycles DRAM read, 2 cycles cache write, 3 cycles cache data write. I haven't actually put a scope on this board, so this is unlikely to be exactly right
Cache hit (read-modify-write) takes 5 cycles
I didn't actually test cache hit (write). Probably 2 or 3 cycles
Cache miss (write) takes 7 cycles. I don't entirely understand why this is so much faster than the read - again, scope would help

This is probably fairly typical of late gen 386 boards. I've no idea which cache is being referred to in the initial post, so can't say how similar that would be.

Reply 4 of 11, by superfury

Posted on 2017-10-12, 21:43

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5458
Joined: 2014-03-08, 11:25
Location: Netherlands

Does that really improve performance? Afaik the 386 takes 2 cycles to access memory, with 1 cycle added due to Waitstate RAM? So 3 cycles in total? That's the same az the minimal cache hit timing you're giving (Cache Hit(read))? Then why are those caches added if they only add to RAM timings? And it's usually the reads that are done the most(variables and code)? Sounds counter-productive?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 11, by vladstamate

Posted on 2017-10-13, 17:20

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/html/ch11/11-02.html

Few things to slow down the theoretical 3 cycles per 32bit transfer that a 386DX should be able to do:

- My guess is 1WS memory was not that common but 2-3 was.
- the DRAM refresh seems to steal cycles. I do not know how much and when exactly.
- my other guess (based on Abrash) is that the number of WS varies with this: "which interleaved bank and/or RAM column was accessed last."

I think some more knowledge of the DRAM architecture is required to fully understand this...

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 6 of 11, by vladstamate

Posted on 2017-10-13, 17:22

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

SarahWalker wrote:
For what it's worth, the only 386 cache design I've looked is for the OPTi495SX. That has (with 32k cache and 8k tag) the following characteristics, at 40 MHz, with default timings as provided by AMIBIOS

Also, thank you Sarah for that data. It is very useful!

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 7 of 11, by SarahWalker

Posted on 2017-10-13, 18:20

SarahWalker Offline

Rank Member

Rank: Member
Posts: 262
Joined: 2007-08-19, 10:51

vladstamate wrote:
I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/h […]
Show full quote
I think it is more complicated than that. Here is an excerpt from Michael Abrash's book: http://www.phatcode.net/res/224/files/html/ch11/11-02.html

Few things to slow down the theoretical 3 cycles per 32bit transfer that a 386DX should be able to do:

- My guess is 1WS memory was not that common but 2-3 was.
- the DRAM refresh seems to steal cycles. I do not know how much and when exactly.
- my other guess (based on Abrash) is that the number of WS varies with this: "which interleaved bank and/or RAM column was accessed last."

I think some more knowledge of the DRAM architecture is required to fully understand this...

It all gets very messy, particularly as CPU clock speeds increased!

I've just dug out the PS/2 model 70 reference manual - this is a 386DX system with 16, 20 and 25 MHz versions, with varying memory systems across the three. The 16 MHz version is using fast page mode RAM with no cache, with the below performance :

Memory Read (Page Hit) - 0 wait states
Memory Read (Page Miss) - 2 wait states
Memory Write (Page Hit) - 1 wait state
Memory Write (Page Miss) - 2 wait states

They don't state what a page is, on a 1 MB system with a 32-bit data bus this will probably be implemented as 256kx32, meaning 18 address bits. This would make a page 9 bits + 2 to account for the 32-bit bus, giving a page size of 2 KB.

The 20 MHz version uses the same timings, but presumably needs faster DRAM to achieve them.

The 25 MHz version adds a 64kb cache, with the below timings :

Memory Read (Cache Hit) - 0 wait states
Memory Write (Cache Hit) - 0 wait states
Memory Read (Cache Miss, Page Hit) - 0-2 wait states
Memory Write (Cache Miss, Page Hit) - 1 wait state
Memory Read (Cache Miss, Page Miss) - 3-5 wait state
Memory Write (Cache Miss, Page Miss) - 0 wait states

The manual states that writes are buffered on the motherboard, hence the odd write timings. I suspect my 386DX/40 board is doing something similar. It doesn't say what the variations in the read timings are caused by, which is extremely helpful from an emulation perspective.

Refresh for all three is given as performed every 15.1us, with delays from 500 to 600 ns - this would work out as roughly 5-7 cycles.

Reply 8 of 11, by SarahWalker

Posted on 2017-10-13, 18:27

SarahWalker Offline

Rank Member

Rank: Member
Posts: 262
Joined: 2007-08-19, 10:51

Another example would be a board I have lying around, an ECS 386/32. This is a 386DX/20 board using a C&T CS8230 chipset, which implements interleaved memory. From what I can tell, this gives 1 W/S in normal operation, but if two identical banks of memory are available then sequential accesses will drop to 0 W/S. I can't say how this performs in practice as I don't have enough SIPPs to test it!

On the other end of the performance scale are a couple of 386SX boards, a 386SX/25 which runs at 2 W/S and a 386SX/40 which runs at 5 W/S. Once you account for the clock speeds this gives both boards roughly the same memory speed. Unsurprisingly there's very little performance difference between the two! I believe this is typical of most 386SX boards - there are a few cached boards that would be noticeably faster but most will run fairly basic DRAM controllers with fairly conservative timings.

Reply 9 of 11, by vladstamate

Posted on 2017-10-13, 18:44

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

SarahWalker wrote:

Refresh for all three is given as performed every 15.1us, with delays from 500 to 600 ns - this would work out as roughly 5-7 cycles.

Good information. Some back of the napkin calculations (if I understand this correctly):

A cycle for a 386@40Mhz would last 25ns. Or 62ns for a 386 at 16Mhz.

If rounded at 15us, for a 386 @40Mhz that would mean roughly every 600 cycles it needs a DRAM refresh. At 500ns per refresh that is 20 cycles.During which we cannot access some (or all) of the memory.

For a 386 at 16Mhz the refresh would last about 8 cycles and happen every 242 cycles.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 10 of 11, by SarahWalker

Posted on 2017-10-13, 18:55

SarahWalker Offline

Rank Member

Rank: Member
Posts: 262
Joined: 2007-08-19, 10:51

That sounds more or less correct - and given the sheer number of 386 motherboard designs that exist, 'more or less' is about as accurate as you're going to get!

Note though, that on cached systems refresh won't stall the CPU unless it is actively accessing DRAM at that point. If it's hitting the cache at that point then refresh will be unnoticeable.

Also note that some (later) designs may be implementing hidden refresh, in which case the refresh will be unnoticeable even during a DRAM access.

Reply 11 of 11, by SarahWalker

Posted on 2017-10-13, 18:58

SarahWalker Offline

Rank Member

Rank: Member
Posts: 262
Joined: 2007-08-19, 10:51

superfury wrote:
I'd assume it's a 1 cycle read for a cached 32-bit block of data? So any byte, word or dword entry that's in the same range of the cache entry(e.g. address 0x12345678-0x1234567B) gets updated by any write to that address range, with reads after such a write reading directly from the cache itself, instead of main memory? Although that would be a problem if something's written without it having been read first(e.g. writing to address 0x12345678 without having read from 0x12345679-0x1234567B)? Not too sure about that, though? Maybe some kind of 4-bit bitmask to allow parts to be written back instead of the entire dword being written to memory or read?

No, in general most (all?) motherboard-based cache designs will ignore writes to addresses that aren't already in the cache - the write will go straight to main memory.

CPUs starting from the Pentium Pro generation implement 'write-allocate', where a write to an address not already in cache will cause the relevant cache line to be read in from memory. That's not really relevant for the 386 generation though.

Main menu

Common searches

386 Cache Design

Topic actions

First post, by vladstamate

Reply 1 of 11, by SarahWalker

Reply 2 of 11, by superfury

Reply 3 of 11, by SarahWalker

Reply 4 of 11, by superfury

Reply 5 of 11, by vladstamate

Reply 6 of 11, by vladstamate

Reply 7 of 11, by SarahWalker

Reply 8 of 11, by SarahWalker

Reply 9 of 11, by vladstamate

Reply 10 of 11, by SarahWalker

Reply 11 of 11, by SarahWalker