CharlieFoxtrot wrote on 2026-02-09, 17:41:
And compared to 386DX the performance hit was roughly 50%, sometimes a bit less depending how memory and IO intensive the software was.
This number seems quite high if you run 16-bit software. As the 386 core does not have internal cache, it performs data every memory cycle as instructed by application software, that it even the 386DX does not perform any data memory cycle using 32 bits at once unless instructed so by software. 16-bit software mostly does not instruct a 386DX processor to perform 32-bit cycles. There are a couple of 16-bit processor instructions that do access 32-bits of data at once (like indirect FAR jumps and calls, LDS, LES and LSS), but these instructions are typically not used often enough that the 32-bit memory bus of the 386DX causes a notable difference.
There are some advantages of the 32-bit bus, though. I carefully wrote data memory cycles in the previous paragraph, because the prefetch queue of the 386DX is 4 32-bit words long, and will be refilled using 32-bit cycles. Instruction fetch bandwidth thus is twice as high on a 386DX system compared to a 386SX system. Furthermore, if EMM386 or another virtual memory manager is active (without doubt, this is one of the main selling points even of the 386SX), the processor needs to fetch memory mapping information from the page table. The page table entries are 32 bits wide and aligned, so they can be fetched in 1 cycle on the 386DX, but require 2 cycles on the 386SX.
There are other processor-initiated operations that make use of the 32-bit bus, like fetching interrupt vectors or hardware task switching. The performance of these operations should not dominate processor performance in a sensibly designed system (and if your interrupt rate is high enough for this to matter, you better do not use EMM386 at all, a 286 in real mode will likely beat a 386DX with EMM386 at handling interrupts by a big margin).
I do believe though, that a typical 386DX-33 system back in the day was 50% faster than a similarly clocked 386SX-33 system, but the key point is that the 386DX platform was higher end, and most mainboards provide cache, whereas many 386SX mainboards are budget boards without cache. As the 386 bus protocol requires 2 clocks per bus cycle in the optimal case (the 286 protocol does so, too), running 0WS a 386 at 33MHz requires a data transfer every 60ns. While this rate is possible to achieve staying on a page, uncached 386 systems typically often didn't run at 0WS back in the day.
The importance of the cache is confirmed by a recent video by Adrian Black on Adrians Digital Basement. In that video, Adrian showed a cached 386SX-40 that achieved 61MHz in Landmark 6.0, which is way higher than the 39 reported in this thread.