bakemono wrote:Since the normal memory access on a 386 is 2 cycles, then the 2-cycle cache is truly zero-wait. The 3-cycle cache would impose one wait
I would not treat the numers presented by BIOS as absolute truth. The actual number of cycles to access memory on 386 is probably something like 2 + Lconst + Lset. Where 2 is the CPU limit, Lset it what you set in BIOS but Lconst is a constant value that's simply required by the chipset/mobo to operate correctly. And it can be non-zero. It's also not presented anywhere, because why admit it's not zero and also the typical end user wouldn't need that piece of info anyway.
There's also a question of how the LODS is microcoded, that is how far into it's 6-cycle execution the bus unit is actually told all it needs to perform the transfer. Since it has to load the accumulator it must wait for the data to arrive, so it will stall if memory access isn't completed by that point. Data dependancy (even if false in case of REP LODS) is a huge problem in instruction pipelining. Meaning you might not be getting full 6 cycles to mask the memory latency here. I no longer remember such details and I'm too lazy to look it up right now.
Counting cycles stopped to be easy with 386 and got even more difficult with each successive x86 core. For example, rarely anyone even considers latency penalty of TLB misses.
bakemono wrote:Does a 386SX add two cycles for the additional memory access or can this also overlap?
386SX bus unit will internally split a dword transfer into 2 back-to-back word transfer so it's 4 cycles at minimum but for each of those, since it's not a burst transfer of any kind, you have to generate address for the chipset to use. In other words the mobo doesn't know if it's a split dword or 2 fast words being read from different memory locations. So unless the chipset internally tracks the last and current address it might have to go through all the motions of random access and each transfer will incur it's own latency.
If, say, LODS actually only gives you only 4 cycles to mask the access latency then any wait states will become visible on the SX.
The reason for MOVS being so fast, on paper, is that it doesn't need to feed the data it read to the core. So basically it's just the core supplying the bus units with addresses until the transfer ends. There's no dependency other than the previous memory transfer having ended so it's completly bus-bound.
In general cache on 386SX is more trouble than it's worth - sort of an obvious conclusion considering it's supposed to be a cheaper platform than full-fat 386DX. That's my opinion, I've never owned a cache-equipped SX system but I've seen memory results from CACHECHK and I've compared it with mine. A reasonable SX mobo will have true memory latency 2-3x faster than the cached system, at least in case of OPTi chipsets. So while cache does end up a bit faster than RAM, true, the actual RAM latency becomes very bad. It might make more sense if you plan to have a 486SLC type of CPU there from the start and can therefore optimize cache for burtst but not otherwise.
EDIT: To complicate matters further the 386 bus unit has a NA# (next address) signal, which the chipset can assert to get the transfer address/size 1 cycle earlier that it would normally be presented. This can speed up memory access and would allow the chipset more time for any internal comparison to determine if it's a next or random location, thus allowing latency optimizations / bank switching / cache tag prefetch, etc. But using NA# properly is not that easy to pull off, and also you won't always get that opportunity - it depends if the bus unit already knows the address in advance or not. The mobos I've dealt with do not use this feature at all it seems.