VOGONS


First post, by mkarcher

User metadata
Rank l33t
Rank
l33t

I'm currently working on one of Disruptor's systems with an Intel SR440BX ("Sun River") mainboard. That mainboard has an on-board AGP graphics chip: The nVidia Riva TNT. The top PCI slot shares the IRQ with the graphics chip. IRQ sharing is not supposed to be a problem in PCI systems, because the specification says so. It is well known that In practice, IRQ sharing on the PCI bus sometimes is a problem, especially when using older systems. And indeed, as soon as the NVidia TNT / TNT2 Windows 3.11 drivers are installed (search the web for TNT5W311.ZIP, you will find many sources) and the Adaptec 32-bit disk access is configured, Windows 3.11 will lock up when you start it. Obviously, this problem can be worked around by not putting the SCSI card in the slot that shares the IRQ with the graphics chip, but if all slots are supposed to be used, and all slots use cards that can generate interrupts, some card has to share the IRQ with the graphics chip, so I investigated the stuff.

It turns out that despite its age, the Windows 386 kernel, at least since Windows 3.1, but possibly even for older versions is perfectly able to support shared interrupts! Multiple drivers can register interrupt handlers, and as long as all drivers adhere to the "sharable interrupt" protocol, the Windows kernel passes the interrupt to each driver in turn until a driver claims that it handled the interrupt. In that case, interrupt handling is finished. If a second device sharing the IRQ also has a pending interrupt request, the nature of level triggered IRQs will cause another IRQ to be triggered just as the first IRQ is finished, and that time the first driver will (most likely) not claim the IRQ anymore, and the IRQ is delivered to the next driver. Furthermore, as soon as a "shared interrupt" handler is registered, the Windows kernel also installs a fallback handler that reflects the IRQ to DOS/BIOS drivers that were loaded before Windows was started. So the Windows kernel is fine. The reason for proper interrupt sharing support in Windows 3.1 is most likely that the IBM PS/2 with its MCA bus already was specified to support interrupt sharing.

So the next issue could be that one of the drivers does not implement the sharable interrupt protocol. But it turns out that both the Adaptec driver (FASTSCSI.386) and the nVidia driver (NV4VDD.386) register their interrupts as sharable. A sharable interrupt handler is supposed to indicate whether there was an interrupt pending for this handler by clearing the carry flag before returning. If there was no interrupt on the device this handler is meant for, the handler is supposed to set the carry flag instead. This enables the interrupt dispatcher in the Windows kernel to tie the strings in the correct way and deliver interrupts where they belong. I found something funny in the graphics driver, though: It checks whether it is the first driver that will handle the IRQ. This driver is the one that gets asked last by the Windows kernel to handle an interrupt - and if that driver reject handling (returns with carry set), the Windows kernel will pass it to DOS/BIOS. In case the nVidia driver is in this position, though, it does not reject the interrupt, but it asks the interrupt manager to refect the interrupt to DOS/BIOS code. This should be entirely unneeded.

So the starting point is: The system works with 32-bit disk access disabled ("WIN /D:F" can do this temporarily). The system works if the Adaptec card is in a different slot. The system crashes if the Adaptec card and the graphics chip share the interrupt with 32-bit disk access enabled.

First, I removed the seemingly unnecessary special handling of the "last in chain" case in the NVidia driver, as I did not yet correctly understand the order in which the drivers get called (and I understood some Internet resource to claim a different order than actually occurs), assuming that the "directly to DOS" implementation prevents the 32-bit SCSI driver to receive the IRQ, but allows the 16-bit ASPI/BIOS to receive it. The result was that neither BIOS-based nor 32-bit disk access worked anymore, as long as the IRQ was shared. So this strange "special handling" actually improves something. I was unable to find out what specifically caused these issues until I used WDEB386 to debug the Windows 3.11 kernel, and trace what happens when the graphics driver is called with a SCSI IRQ pending. The graphics driver was not in its special "last in chain" mode, because the SCSI driver registered the IRQ first (thus it will receive it only if the graphics driver doesn't claim it). The graphics driver indeed tried to not claim it, by calling a mode-dependent function that just returned "1" (not handled). Then it used this assembly code to translate the value in EAX to the intended carry flag value:

    STC                     ; Initially set the carry flag
OR EAX,EAX ; Set the zero flag if EAX is zero
JNZ leave_carry_set ; jumps if "not zero", i.e. the zero flag is clear
CLC ; EAX was zero, clear the carry flag

I completely missed the bug in this instruction sequence (and I intentionally added comments that do not point out the error, but explain the most likely train of thought of the authors), because the sequence looked plausible. Only when I stepped through this sequence using the debugger, I found that the debugger suddenly reported "NC" (no carry) when the driver was supposed to indicate an unhandled interrupt by returning with the carry flag set. Looking back in the execution trace, I found that after STC, the debugger correctly indicated "CY" (carry set), but as soon as the instruction to test EAX for zero was executed, the carry flag was cleared again. And of course it is! The OR instruction does not just set the zero flag according to the result, but it also unconditionally clears the carry flag. Every x86 assembler programmer should know this. The upside: It only took me half a day to track down this bug, and I learned a lot about Windows 3.1 386 mode IRQ handling. A correct way to implement the mapping is

    CMP     EAX, 1          ; Set the carry flag if EAX is less than 1 (i.e. it is zero)
CMC ; invert the carry flag

Which is shorter than the original code, and thus can be patched in place. Furthermore, I got curious and disabled the "last in chain" special case by NOPping out the conditional jump into the handler for that case, and even after removing the special case (removing it initally broke the 16-bit case as well), now both the 16-bit and the 32-bit case are working perfectly. It seems some programmer at nVidia had the task to solve the "system freeze when the IRQ is shared" (which is caused by the lost carry flag) in a situation in which the graphics driver was last-in-chain (e.g. SCSI without 32-bit disk access). Instead of finding the root cause (losing the carry flag), the programmer added the code that manually invokes the 16-bit handler and then intents to return with the carry flag clear. The better way to fix the bug would obviously have been to fix the root cause. So let's do it. Patch these bytes in %windir%\SYSTEM\NV4VDD.386 (241271 bytes, dated 10.02.99, 16:27):

006E64: F9 09 C0 75 01 F8 -> 83 F8 01 F5 90 90
038368: 74 27 -> 90 90

The first patch will correct the carry flag logic. The second patch will disable the hack that papers over the broken carry flag logic.

Last edited by mkarcher on 2024-01-29, 20:22. Edited 2 times in total.

Reply 2 of 3, by mkarcher

User metadata
Rank l33t
Rank
l33t

Not that surprising, this also applies to the Windows 3.11 driver for previous generation of graphics chips.

Patch for NV3VDD.386 (247927 bytes, dated 04.09.98, 17:28):

0068FD: F9 09 C0 75 01 F8 -> 83 F8 01 F5 90 90
039CAB: 74 27 -> 90 90

Reply 3 of 3, by mkarcher

User metadata
Rank l33t
Rank
l33t

Trying to find some further driver affected by this bug, I downloaded Detonator v2.08 from the VOGONs driver library. While it still contains a slightly different machine encoding of the broken code, it contains a second layer of setting the carry flag correctly before returning from the IRQ handler, so the Windows 95/98 drivers should have no issues with IRQ sharing.