UselessSoftware wrote on 2025-03-31, 20:21:
vstrakh wrote on 2025-03-31, 09:25:
So what was the issue with PIC not relocating interrupt vectors?
I think the issue was unrelated and it wasn't even supposed to do that, rather it only services IRQs in real mode.
The PIC always serves interrupts, though. In all CPU modes, as long as the CPU allows it (interrupy flag). The host OS does disable (mask) all interrupts when it switches to multiprocessor mode using the local APIC and IOAPIC configuration, though. In that case, the APIC on the CPU's internal bus and IOAPIC (replacing or mixed with the PIC, depending on it's setup, like virtual wire mode etc.) will take over multiprocessor-capable interrupt handling. Newer CPUs (486 with external support chips or Pentium(internal per-CPU chips) also can disable the INTR line itself using some APIC registers by masking it entirely or performing manual handling of the INTR INTA signals using APIC commands. Though DOS-based Windows versions don't use (IO)APIC.
If the interrupt mask register stays 0xFF, perhaps IRQ detection fails or an early protected mode crash is occurring before it's re-enabled after switching to protected mode.
Just a simple check, though: are you implementing the descriptor cache and Paging TLB? Windows requires those two caches to work properly. For example, setting CR0.PE doesn't toggle the descriptor cache on or off. It's always on and retains the (un)real mode-compatible values until CS is reloaded (through a far jump, exception or interrupt). The real mode for example simply (partly) loads the descriptor cache with some or all descriptor values (leaving the limit alone for some cases. CS behaves different than other registers depending on Pentium vs 486- CPU model being used, with the newer CPUs ignoring some values in real mode and older CPUs loading values into it that newer CPUs leave alone). Then the 286 and 386/486 have LOADALL in different formats and sticky PE bit(286), even a SAVEALL too (undocumented, hanging the CPU after saving requiring an external reset afterwards due to missing CPU connections). HIMEM for example uses LOADALL on 286 at least to implement unreal mode and access high memory locationa, maybe some 386+ software use it as well (the BIOS emulating 286 LOADALL with the 386/486 one).
Windows also requires the Paging TLB to behave properly, causing weird crashes if it doesn't. It also performs relatively odd (late) TLB invalidation in some cases.
Edit: Paging TLB info:
https://blog.stuffedcow.net/2015/08/pagewalk-coherence/
UniPCemu for example provides a 4-way 32-entry TLB, split for 4KB and 4MB/2MB entries (so 64 entries in total on Pentium CPUs and newer).
It keeps a relatively big 1MB(4KB)+2KB(2/4MB) lookup table for each CPU (up to four) to speed up lookups (each byte specifying one real TLB entry, being zero for not in the TLB. Though that's 4MB+8KB of fast lookup data, which is quite a lot on the lowest memory device it supports (only some 20MB RAM available on the PSP for example, so that leaves less than 16MB, substracting about the same 4MB for the executable itself right now).
It also uses a doubly linked list that points to the actual TLB entries to provide fast MRU/LRU services. Basically MRU is updated by moving a list item to the head of the list. Invalidation is performed by moving from an used(in-use) list to it's corresponding free list, so there's actually 4 pointers for each way: 1 for in-use (cached) head, 1 for in-use (cached) tail, 1 for free head and 1 for free tail. Simple moving is performed by unlinking, updating head/tail if the entries' previous/next is zero, then adding to the head of the destination list (used or free).
It's TLB is fast enough to not be visible in profiling on the devkits I used (mostly Visual Studio). Most of the actual overhead is in the physical memory accesses itself (which are mostly reads, due to them being more common). Perhaps I'll add the same kind of caching for memory accesses by the BIU someday to improve that (it's currently basically a 1-entry TLB for reads, writes and code fetches, so it has pretty high overhead).
Still getting over 10% speed even with the slow RAM/ROM accesses, though. Protected mode is also almost just as fast, due to the TLB and descriptor caches. The only thing adding extra overhead there is mainly protected mode stuff (intertupts, exceptions, page table walks on TLB misses), but those still pale in comparison with the memory accesses themselves, due to those being too random (not to mention PCI emulation overhead, different (split) memory spaces for certain memory ranges (UMA for example) and it's behaviour). So the lookups (actually performed for RAM mapping much like paging table walks) invalidate themselves a lot, despite 128-bit data caches (though those speeding up stuff like 16/32/64-bit memory reads and PIQ prefetching in Dosbox-compatible IPS clocking mode (up to max instruction length bytes)).