The missing information here seems to be that Pentium Pros were bad at running 16-bit code because of the register renaming that […]
Show full quote
The missing information here seems to be that Pentium Pros were bad at running 16-bit code because of the register renaming that was implemented as part of the Out-of-Order-Execution logic.
The key word here is 'partial register stall'.
That is, register renaming treats all registers, including 'partial' registers (eax being the full register, ax, ah and al the partial ones, for example) as separate internal registers.
However, there is overlap, so when another part of the register is accessed (eg, al was modified, then ax or eax is read), multiple internal registers need to be combined. This requires a pipeline flush to make sure that all register values are current.
Since 16-bit code always uses partial registers, this leads to excessive pipeline flushes in the Pentium Pro.
I don't think there's any 'special case' for realmode, because you cannot assume that you don't use 32-bit registers in realmode. Code can and will use the full 32-bit registers in realmode as well.
In the Pentium II this was fixed by adding some extra logic: when a full register is zero'ed (usually with xor eax, eax or such), the zero-flag also triggers a special state in the register renaming logic: Because it knows the full register was zero, there is no recombining required, and the pipeline flush can be skipped. Of course this still fails on legacy code where there's no explicit zeroing of the registers. But when you're in 16-bit mode, the registers start out as zero'ed, and the problem will not occur until you explicitly start using full 32-bit registers.
See also: http://qcd.phys.cmu.edu/QCDcluster/intel/vtun … rtial_Stall.htm
Pentium II also added caching for segment registers to improve 16-bit code performance.
So in short, I expect a Pentium Pro to be quite bad for DOS in general. But I've only used it with Win9x and NT4 myself, so I can't be 100% sure.
The problem is mixing registers of different sizes, either 16-bit and 32-bit or 8-bit and 16-bit. Especially legacy code will often use partial registers, because there was no penalty for it before the Pentium Pro. You could often optimize things considerably with clever use of partial registers.
Mixing 16-bit OS/BIOS code with 32-bit applications or vice-versa is both going to be a recipe for disaster on the Pentium Pro.
32-bit applications are no guarantee that they won't use partial registers though. The only 'good' 32-bit applications for Pentium II are ones that are compiled with a compiler that is Pentium II-aware, and always inserts the xor reg, reg sequence to avoid stalls. For Pentium Pro, even that doesn't really help, I believe.