VOGONS


First post, by noshutdown

User metadata
Rank Oldbie
Rank
Oldbie

test system:
amd386dx-40 cpu
opti 82c495 chipset
32kb cache 2-1-1-1 in bios
8mb dram 1wait

cachechk reports:
cache speed: 40us/KB
cache missed dram speed: 110us/KB
uncached dram speed: 67us/KB

each clock is 25ns for a 386dx-40, so they translates into:
cache access: ~1600 clocks/KB
cache missed dram access: ~4400 clocks/KB
uncached dram access: ~2680 clocks/KB

since each read/write operation transfers 4 bytes, it takes 250 cycles to transfer 1KB of data. therefore the average clocks for each read/write operation is:
cache: ~6.4 clocks per read
cache missed dram: ~17.6 clocks per read
uncached dram: ~10.7 clocks per read

this seems right because if i change cache timing to 3-2-2-2 in bios, cache speed and cache missed dram speed change to 46us/KB and 117us/KB respectively.

Reply 1 of 5, by Deunan

User metadata
Rank l33t
Rank
l33t

As you can see OPTi 495 chipset is not exactly all that great.

The values you got are pretty much identical to mine. Cache access is 39-40us, cache miss is 109-110us. If you consider that non-cached access to RAM is some 67us in your case then you get 40+67 ~= 110. Which means cache access is always 40 and in case of miss you need to add RAM access on top of that. That's poor design, the chipset should be doing both and if there is a miss you'd only need extra 20-30 to get data from RAM. I guess the problem was how to abort RAM access in case there is a cache hit so it doesn't become a latency penalty if the next access is a miss.

Fun fact: 386DX can do memory read/write in 2 cycles so in theory the CPU itself could do 80MB/s at 40MHz if only the chipset/memory was fast enough. As it is, not even the cache is close to half that value.

Reply 2 of 5, by bakemono

User metadata
Rank Oldbie
Rank
Oldbie

386DX takes 6 cycles to do one iteration of a REP LODSD loop. If cache timing is 2-1-1-1 then the first read must take 7 cycles, subsequent reads have no wait and take 6. Maybe a little bit of time is also lost to interrupts. So average is 6.4 cycles.

Using 3-2-2-2 means one additional cycle for each read. Average becomes 7.3~7.4

10.7 cycles for uncached DRAM. Not very fast is it? Perhaps the memory controller does not use fast page mode access, and instead incurs 4 waits for each read, with additional time lost to DRAM refresh. I can't imagine that throughput would be this low with FPM DRAM.

Cache miss is even worse. I assume that the CPU is waiting while data is transferred from DRAM to cache, and then finally to the CPU. Either that, or the cache is write-back and some data needs to be written to DRAM (probably not on a chipset which doesn't even do FPM)

Reply 3 of 5, by Deunan

User metadata
Rank l33t
Rank
l33t

For a 386 CPU on 495 chipset the only cache latency number that matters is the first one. The rest is only for burst transfers on a 486. So in other words it doesn't matter if it's 2-1-1-1 or 2-2-2-2, the latency is 2 cycles. Same for 3-x settings, it's always 3.

You can't just add those cycles to LODS latency though, the bus unit on 386 is capable of pipelined transfers independent from the rest of the core. Since each iteration takes 6 cycles, and bus transaction takes only 2, even including external cache latency it'll be less then 6 so the limiting factor will be LODS dispatch. It'll probably end up a bit higher than 6 due to other factors like RAM refresh cycles that will kick in every 15us or so.

Also, I think the cache on 495 is WT, maybe except the X chipsets but I really can't remember the details right now.

Another fun fact: REP LODS is kinda useless except for benchmarking, and even then it has poor latency. REP MOVS can do each iteration in 4 cycles, so 2 for read and 2 for write, the perfect throughput. Obviously it requires aligned addresses and those other factors I mentioned above will still matter.

Reply 4 of 5, by bakemono

User metadata
Rank Oldbie
Rank
Oldbie

That's some interesting info. Since the normal memory access on a 386 is 2 cycles, then the 2-cycle cache is truly zero-wait. The 3-cycle cache would impose one wait, however in the case of LODSD that extra latency should overlap the CPU's internal latency and have no effect? If so, then that would mean the real latency of accessing DRAM is even longer if it causes the 6-cycle instruction to stretch to 10+ cycles. Still kind of baffling.

I have a couple of 386 benchmark results recorded from back in the day and they are similar. 386DX-33 got 21MB/s for cache hit and 9.6MB/s for cache miss. 386SX-16 got 7.2MB/s for cache hit and 3.6MB/s for cache miss. Does a 386SX add two cycles for the additional memory access or can this also overlap?

Also interesting that MOVSD can be quite fast on a 386 (depending on memory subsystem). In practice, it's not very fast on anything else until you get to Athlon 64 with the integrated memory controller.

Reply 5 of 5, by Deunan

User metadata
Rank l33t
Rank
l33t
bakemono wrote:

Since the normal memory access on a 386 is 2 cycles, then the 2-cycle cache is truly zero-wait. The 3-cycle cache would impose one wait

I would not treat the numers presented by BIOS as absolute truth. The actual number of cycles to access memory on 386 is probably something like 2 + Lconst + Lset. Where 2 is the CPU limit, Lset it what you set in BIOS but Lconst is a constant value that's simply required by the chipset/mobo to operate correctly. And it can be non-zero. It's also not presented anywhere, because why admit it's not zero and also the typical end user wouldn't need that piece of info anyway.

There's also a question of how the LODS is microcoded, that is how far into it's 6-cycle execution the bus unit is actually told all it needs to perform the transfer. Since it has to load the accumulator it must wait for the data to arrive, so it will stall if memory access isn't completed by that point. Data dependancy (even if false in case of REP LODS) is a huge problem in instruction pipelining. Meaning you might not be getting full 6 cycles to mask the memory latency here. I no longer remember such details and I'm too lazy to look it up right now.
Counting cycles stopped to be easy with 386 and got even more difficult with each successive x86 core. For example, rarely anyone even considers latency penalty of TLB misses.

bakemono wrote:

Does a 386SX add two cycles for the additional memory access or can this also overlap?

386SX bus unit will internally split a dword transfer into 2 back-to-back word transfer so it's 4 cycles at minimum but for each of those, since it's not a burst transfer of any kind, you have to generate address for the chipset to use. In other words the mobo doesn't know if it's a split dword or 2 fast words being read from different memory locations. So unless the chipset internally tracks the last and current address it might have to go through all the motions of random access and each transfer will incur it's own latency.
If, say, LODS actually only gives you only 4 cycles to mask the access latency then any wait states will become visible on the SX.

The reason for MOVS being so fast, on paper, is that it doesn't need to feed the data it read to the core. So basically it's just the core supplying the bus units with addresses until the transfer ends. There's no dependency other than the previous memory transfer having ended so it's completly bus-bound.

In general cache on 386SX is more trouble than it's worth - sort of an obvious conclusion considering it's supposed to be a cheaper platform than full-fat 386DX. That's my opinion, I've never owned a cache-equipped SX system but I've seen memory results from CACHECHK and I've compared it with mine. A reasonable SX mobo will have true memory latency 2-3x faster than the cached system, at least in case of OPTi chipsets. So while cache does end up a bit faster than RAM, true, the actual RAM latency becomes very bad. It might make more sense if you plan to have a 486SLC type of CPU there from the start and can therefore optimize cache for burtst but not otherwise.

EDIT: To complicate matters further the 386 bus unit has a NA# (next address) signal, which the chipset can assert to get the transfer address/size 1 cycle earlier that it would normally be presented. This can speed up memory access and would allow the chipset more time for any internal comparison to determine if it's a next or random location, thus allowing latency optimizations / bank switching / cache tag prefetch, etc. But using NA# properly is not that easy to pull off, and also you won't always get that opportunity - it depends if the bus unit already knows the address in advance or not. The mobos I've dealt with do not use this feature at all it seems.