Jo22 wrote on 2024-09-25, 00:31:
What about self-modifying code? As used in demoscene?
Isn't a fast 386 without the need on relying on L1 cache less being affected by cache-misses here?
Self-modifying code might be changing just a few bytes ahead (for speed) or the entire sequence of code/data (decryption or unpacking). But even in the latter case it's not much different from 486 ecountering the code for the first time, or after a massive L1 cache spill. Yes there will be a load penalty but after each line load the 486 can feed the decoder with full cache lines at once (so 16 bytes and not 4 like on 386) and both decode faster and execute in 1 cycle for many basic ALU instructions. 386 min execute time is 2 cycles since that's how fast a single bus transaction can be, so there is no point in having faster core. It's even held back by the decooder so that would have to be upgraded first.
So in general 486 can overcome its own limitation of stalling during each cache line load and in the end be faster than 386 even in worst case scenarios, and at lower clock. Instruction latency is less predictable unless you also pay attention to memory alignment but it only really affects loops, and even in that case after the first load the data will be in L1 so next pass will more than make up for it, and each pass more will be a pure speed benefit.
And frankly 386 is not all that happy to deal with self-modifying code either. There are 2 queues in the instruction decoder, the bytes read and the decoded instructions (in the rare cases you ecounter a slow instruction like mul or div that allows the decoder to move ahead of the current instruction pointer). Both need to be flushed if your modified code is just ahead and you want to properly reload it. This is done with a jump instruction and IIRC even the shortest jump takes 7 cycles on 386, plus whatever the decoder needs to fully load and process the next instruction, 4 bytes at a time at best. 486 needs only 3 cycles, and also L1 fetch and decoding but as I mentioned above it's not as bad as it looks.
The self-modifying or heavily branching code is however a case where 386 would benefit from shorter cache lines.