Within UniPCemu, I notice that Windows 3.0 in 386-mode triple faults at/during it's very first Page fault handling VxD driver call. I see it tries to call the VxD driver using the interrupt 0x20 VxD page driver call, but before it manages to return, it crashes.
Anyone has a bit of knowledge on Windows 3.0's page fault handling and accompanying VxD driver calls can help me out on this?
Last edited by superfury on 2018-08-20, 06:48. Edited 3 times in total.
Hmmmm.... Having fixed a bug that made limit checks against a sign extended version of the windows kernel address(8000XXXX becoming offset FFFFFFFF8000XXXX instead of it's proper checking address), it once again reaches said interrupt 0x20 VxD driver call, after which it simply somehow returns to the MS-DOS 5.0 prompt instead of continuing to boot Windows? Maybe some problem in the VxD driver or it's handler? It's ending up a 0028:80005FC2, according to the debugger(flat code segment)?
It seems that handler tries to access offset 06F4011E, which isn't mapped in the virtual memory? It seems to happen at 0028:802000BD, during a 32-bit operand size MOVZX instruction with an offset pointing there?
Edit: Looking at disassembler output, it's a "movzx eax,WORD PTR [esi]", with ESI seemingly being an invalid address according to the paging unit?
This is what happens, in a simple instruction log:
See what is the value in memory of ss:[ebp-1c] (30:80013294), maybe you're loading the wrong data.
Check if SS is correctly set, maybe it should not be 0x30?
Verify that the value of EBP: 800132b0 is correct.
The possibilities are countless...
So it's written at 018A1294 on the stack. It's value is read immediately before it from address 00000002 in physical RAM. That's because it points there using ESI+02, with ESI being cleared. It looks like it's reading an IVT entry there, probably one set up during boot(The MS-DOS segment 011E of the DIV0 handler as well as the offset of the NMI handler, in one dword read operation). So ESI is cleared, while it obviously shouldn't be for some reason? It doesn't make sense to read two IVT half-entries in one DWORD read operation. So it's obvious ESI should definitely NOT be zeroed there(evidenced by it returning to a fully working MS-DOS prompt after the second Page fault). Now the question: where did ESI come from?
The instruction before that solves that:
10028:80200017 8B 75 08 mov esi,dword ss:[ebp+08]
So EBP+08 is the source of the 0 value loaded into ESI.
Looking for the stack frame being set up, I've found line 7239. There's 11 PUSH EAX(EAX being cleared) above that. That's quite odd? Above that, there's a PUSH push 00023000. Above that another 15 PUSH EAX(50h) instructions. Another odd part(corrupted memory?). Right above that is some setup of a GDT entry(entry 0018) for a LDT and loading said descriptor into the LDTR. Only directly after that, things go a bit weird(or at least it looks like weird code)? Or is it some odd setup for a stack frame? Maybe something being unfilled there(some stack-based variables not being set as supposed to(looking at simple stack frame for a c/c++ program)?
I see it then calling some subroutine that might be somehow translating an address into a physical one(I see a physical address endung up in ESI, if I remember the logs I've just read correctly). Hmmmm....
I see it's setting up a task at the same selector, then loading it for usage through ltr. Then another task is loaded. It's TSS is at virtual address 80010390.
Then it loads a NULL LDT. Then there's the earlier mentioned 15 0h pushes(using EAX), then 00023000(seems to be EFLAGS?), then another 11 EAX(=0) pushes. So 27 doubleword pushes on the stack and loading EBP with the stack base, with EBP+2C corresponding to EFLAGS in the TSS on the stack?
80013390 is the original EBP value of when setting up the TSS frame. the EFLAGS being located at 800133bc. So that's two DWORDs, then followed by the TSS at SS:[BP+02]. Eventually, the subroutine returns to the caller. The caller seems to load the TSS descriptor location into ESI.
The ebp seems to start at logical location 80013390. When adding the offset of EFLAGS within the same-address structure, I end up at 800133bc(assumed eflags location)-24(offset of eflags within the TSS)-80013390(the base address of the stack frame)=offset 8 within the generated stack frame, at ebp+08.
Looking in the middle, there's an ENTER instruction(which has the ebp mentioned as the stack frame, adding a8 bytes of data to a new stack frame). Immediately after that, it pushes ESI(at 80013200 paged) and then edi(at 800131fc).
It then loads the EDI register from 80013208(so EBP-A8), so it's pointing to the bottom of the newly generated stack frame(ebp-ac being esi, ebp-f0 being the edi register of the caller)?
So far so good. Then it clears the low 32-dwords of the stack frame, using a simple rep stosd.
Then it loads ESI from the value at [EBP+08]. But EBP contains the previous stack pointer, EBP+04 being the call return address, and EBP+08 being the vakue push esi pushed before that, at line 7462(value 16f5a). That invalid load happens at line 7689? Something's very wrong there? It reads zeroes instead? The original value was at logical address 800132B8. The addresses match! Something's wrong there!
A little fix, fixing the physical addresses, also allowing page-crossing accesses on the BIU. I've also improved logging of logical("Paged" in the log) addresses, now reporting the proper address(instead of the physical addresses it logged earlier). I've also added the missing segmented("normal" in the log) logging of BIU prefetch accesses. So now all memory accesses show up properly(the memory accesses ending on the paged layer(missing it's normal layer) in a different format being special accesses(either TSS accesses or descriptor table accesses).
Although the paged layer is always active, even in real mode, since it's always used to map(directly).
Edit: Just modified it to only report the Paging layer when it's used. Now it reports Paging accesses(Page tables) only until the physical layer(as there is no Paging layer in effect for Page tables and Page Directories.
Those two seem to be the cause of said data being lost...
Edit: Found a little bug causing the missing log of the first paged byte of multi-byte accesses requested by the BIU(e.g. the push esi and enter instructions above show that). They're still performed though(see the physical and RAM layer before the remaining three bytes).
The eventual ESI value is first loaded in EAX using logical address 800132B8. Looking further up, I see that's the value of ESP BEFORE pushing ESI, so that's something wrong. That one is written to the stack at 800132B4 properly.
Then CALL. It seems to go fine as well, writing it's return address at 800132B0.
Then ENTER. That one's going horribly wrong: EBP becomes ESP(incorrect) and ESP decreased by A8+4(the pushed EBP). So the pushes work correctly, but EBP is loaded with the value BEFORE executing the ENTER instruction, updating it to be the value of ESP BEFORE pushing EBP, instead of it's correct AFTER pushing EBP! 😖
A small update, throwing #GP(16B) etc. faults(set of three interrupts) in a loop? Added logging of double/triple faults and removed V86 interrupt logging.
@barotto: Could you make a common log format log of Windows 3.0 booting in 80386 mode? Then maybe I can compare those to my logs and find out why it's crashing(double faulting on infinite faulting causing stack overflow) my emulator(it's booting fine in real mode and with /r parameter from the EMM386 being loaded(V86-mode running real-mode Windows). 80286 and 80386 both stack&triple fault(80386 on linear memory address 80201002 for a BT modr/m,10h instruction(modrm pointing to 80201000), then more faults(on interrupts in V86 mode) until eventually another page fault and double fault/triple fault).
Edit: Yay! Wfw3.11 setup finally starts without the "invalid parameter:<linebreak>" error back to MS-DOS prompt! 😁
I manage to get Windows 3.11 through the first two disks, but when it tries to start the graphics part, it gives me an error it cannot start Windows in Standard mode and to run "win /3" instead. It's unresponsive to input.
After restarting the emulator, win.com/exe is missing from the Windows 3.11 directory?
The exact message it gives(no faults thrown in protected mode, translation from Dutch):
1Cannot start Windows in Standard mode. 2Try starting Windows in 386 ""enhanced""-mode by typing win /3.
Do you know anything about this case? I see no protected mode faults happening.
Hmmmm... Immediately after starting protected mode, it checks some byte that has the value 3(jump if less(jc), which isn't taken), then immediately drops out of protected mode and seems to abort???
Edit: Another small CPU bugfix: iret/retf to higher privilege levels(resulting CS.RPL<old CPL) now properly throws a #GP(segment|table|EXT) fault(didn't before).
Booting Windows 3.11 during setup on the 80386 somehow fails? I see some interrupt 2F requests for cannonicalizing filenames(according to Ralf Brown's interrupt list), but no actual executables being loaded? Maybe that's the issue somehow?
Can you see what's going wrong(according to the interrupts thrown?)?
Edit: Just improved my CPU identification a bit based on your article on Identifying Processors(mainly adding stuck CR0 bit on 80286-, additional DX values on reset, SGDT/SIDT ignoring operand size(knew about it for the LMSW/SMSW but somehow forgot to implement it for SGDT/SIDT, including the 80286 stuck high bits:S ).
Edit: Whoops. The ET bit wasn't stuck to 1, instead it was forcing the TS bit to 1 instead. 😖
Edit: Improved log after implementing said processor detection support(also mapping 80286 MSW writes to 80386 CR0 writes, adding improved detection support for stuck bits easier(also added CR0 stuck bits to the 80286).
Just made a little improved log(fixing a bug that incorrectly handled logging type(affecting how it's reported) of prefetcher BIU reads from physical memory(during it's accompanying T CPU-cycles).
Also got a different copy of Windows 3.11(from a youtube video describing the setup's validity) that has less disks(6 disks instead of 9), while also being the english version instead of my earlier Dutch version(maybe a bit better for cross-referencing here, since there's not many Dutch people here afaik(save for a few I know of)).
It's the "Windows 3.11 (OEM) (3.5)" version from Winworld(that was what it was called) for reference on the used version for the log.
For reference: https://www.youtube.com/watch?v=y34WhpjSLPg&t=110s at ~1:50 (when starting Windows) is the point my emulator craps out and gives me a black screen with the message Windows couldn't be started and run with /3 parameter(even though win.com/win.exe isn't copied yet:S ) instead of going into graphics mode(crashing before doing that).
Can you make a log of Windows 3.0 starting from the point it enters protected mode and onwards? Then I can compare it with my log and see what's going wrong(maybe minus the HDD/FLOPPY/CD-ROM reading part, as that's system and timing-specific). I'm running it at 3000kIPS(Dosbox-style cycles, with the BIU Prefetch Queue constantly filled after/during each instruction to perform a Dosbox-style prefetch buffer in IPS mode).
Edit: Although the memory writes look odd(being split apart by the PIQ filling from memory), the cause of that is that only after the full cycle block in IPS-mode is processed(so the memory access plus any remaining cycles that are spent on filling the PIQ FIFO) the memory writes are flushed to memory(the physical (and RAM if not a memory mapped device) layer(s)). It looks a bit odd, but that's actually the way the CPU handles memory writes atm(to prevent the writes from immediately affecting the PIQ in front of it, affecting SMC in a simple way).
So the Paged layer you see is actually writing to a simple FIFO instead of the physical layer. Said FIFO is written to the Physical layer each time the CPU finishes a block to process(either a cycle or a non-terminating cycle(a cycle that doesn't take any time, so an unfinished IPS-mode instruction still undergoing execution). Of course, any cycle(both terminating and non-terminating) can have PIQ fetches in between the processing blocks, if allowed(depending on the actual cycle timing provided by the EU in cycle-accurate mode). The Dosbox-style IPS mode is actually just a simple hack/wrapper around the cycle-accurate core(just modifying external BUS timings relative to the CPU clock), while the EU still runs in a cycle-accurate way(the IPS-clocking is applied to the hardware timing and realtime synchonization(being either 1(finished instruction) or 0(unfinished instruction) cycles). Of course, there's security preventing infinite loops there(if you were to have an infinite unfinished instruction(thus always processing 0 cycles)), which simply breaks the main loop to allow surrounding timing to update the real-time synchonization. This keeps the EU simple and portable with the IPS clocking easy to apply(for a significant speedup compared to cycle-accurate mode), while keeping maximum compatibility for the EU(which it's all about) for simple updating of code(without needing two fully implemented cores for each CPU mode).
Sounds like the stack got wrong somewhere in there?
Edit: Looking back up the code, that's the correct address after all(an INT 0Eh at 12F8:00000000).
Edit: Maybe that very interrupt isn't supposed to happen? It doesn't make that much sense to be calling it anyway?
It looks like it's just the value that's loaded earlier with some padding data, after which a block of code is located that was ran earlier in the code(if you look upwards from that point on), I believe it was still in protected mode there? So maybe the INT 0Eh was already an invalid instruction location to start with(EIP=0)? Maybe some uninitialized(probably unfinished) far pointer(or pointer data) somewhere? RET(F) going awry because of stack issues? Or maybe an invalid jump?
The DPMI server is definitely installed. It's probably that thing that switches it back to real mode? Or maybe there's a problem within krnl386.exe or kernel.exe?
Edit: Hmmmm.... Line 56704...
Edit: Hmmmm.... Line 56679. It's data is pushed almost immediately after a call through a call gate... Maybe the call gate itself is the problem somehow?
Recorded a trace of Windows 3.0 starting from the first pmode switch on my emulator, if it's any help: https://drive.google.com/file/d/1Y3fD0QtX9Nls … iew?usp=sharing (3GB uncompressed).
It's not in the same format, but if you're looking for the processor state after a specific instruction, it might give you some clue.
Thanks. Just found out one problem: call gates and retf had ss and esp reversed on the stack(pushes and pop in wrong order). That might fix that issue(since pop sp works correctly now, untested though).
Progress(No more page faults/displayed error)! But now it eventually faults on verifying a descriptor that's loaded(not looked at why yet), and after that I see lots(infinite?) of Divide by Zero exceptions?
Edit: Apparently, the cause of this is an IRET POPPing 0h values for CS&EIP(EFLAGS might be fine, though, but I doubt that(as it can be forced depending on the CPL).
Edit: The DIV0 fault happens at 0053:000013E1, apparently(continuous fault).
Log of Windows 3.11 booting from the setup.exe(from first protected mode onwards).
File license
Fair use/fair dealing exception
What does your emulator do with Windows 3.11 booting from setup.exe(until the graphics mode is started should be enough for now(getting the basic environment running))? Can you make a log of that?
Edit: A slight logging improvement on the BIU part of the logging process(it was incorrectly logging the previous byte written with the current address applied to it, instead of the current byte that's being written).
Fixed logging to log the correct byte with the used logical address.
File license
Fair use/fair dealing exception
Edit: Btw, could you make it in the common log format, if your emulator supports it? That's a bit more readable to compare to my emulation(since it's logging in common log format(with the addition of memory accesses, which haven't been made 'official'(as official as you can get from a WIP standard) yet)).
Edit: Hmmm... Looking for "0000EE24=" within the log reveals the times SS0 is actually looked up in the TSS. But I see non CPL0 loading it?
With the latest bugfixes(made based on the Haribote OS), Windows 3.1(3.11 afaik, but setup says 3.1), Windows 3.11 setup no longer hangs or triple faults, but instead returns to the MS-DOS prompt saying "Standard Mode: Fault outside of MS-DOS Extender.".
Edit: It's cause seems to be a MOVSB instruction in protected mode? It happens at 307:C1A9, which in your code succeeds without any faults?
Although said disassembly is incorrect, according to documentation? it's supposed to be REP MOVSB DS:[SI],ES:[DI] (no instruction prefixes, ES isn't allowed to be overwritten)?
Edit: Hmmmm.... ES is cleared(0000) in UniPCemu, instead of the valid value(0377) in your log?
I know only one case that's clearing the data segments outside of a program's 'control'(not directly loading it using a pop instruction): retf/iret to a lower privilege level with an invalid descriptor inside said segments.
Maybe there's a problem in validating it's contents?
1if (RETF_segmentregister) //Are we to check the segment registers for validity during a RETF? 2 { 3 for (RETF_segmentregister = 0; RETF_segmentregister < NUMITEMS(RETF_checkSegmentRegisters); ++RETF_segmentregister) //Process all we need to check! 4 { 5 RETF_whatsegment = RETF_checkSegmentRegisters[RETF_segmentregister]; //What register to check? 6 word descriptor_index; 7 descriptor_index = getDescriptorIndex(*CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment]); //What descriptor index? 8 if (descriptor_index) //Valid index(Non-NULL)? 9 { 10 if ((word)(descriptor_index | 0x7) > ((*CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment] & 4) ? CPU[activeCPU].SEG_DESCRIPTOR[CPU_SEGMENT_LDTR].PRECALCS.limit : CPU[activeCPU].registers->GDTR.limit)) //LDT/GDT limit exceeded? 11 { 12 invalidRETFsegment: 13 //Selector and Access rights are zeroed! 14 *CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment] = 0; //Zero the register! 15 if ((isJMPorCALL&0x1FF) == 3) //IRET? 16 { 17 CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment].desc.AccessRights &= 0x7F; //Clear the valid flag only with IRET! 18 } 19 else //RETF? 20 { 21 CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment].desc.AccessRights = 0; //Invalid! 22 } 23 continue; //Next register! 24 } 25 } 26 if (GENERALSEGMENT_P(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])==0) //Not present? 27 { 28 goto invalidRETFsegment; //Next register! 29 } 30 if (GENERALSEGMENT_S(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])==0) //Not data/readable code segment? 31 { 32 goto invalidRETFsegment; //Next register! 33 } 34 //We're either data or code! 35 if (EXECSEGMENT_ISEXEC(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])) //Code? 36 { 37 if (!EXECSEGMENT_C(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])) //Nonconforming? Invalid! 38 { 39 goto invalidRETFsegment; //Next register! 40 } 41 if (!EXECSEGMENT_R(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])) //Not readable? Invalid! 42 { 43 goto invalidRETFsegment; //Next register! 44 } 45 } 46 //We're either data or readable, conforming code! 47 if (GENERALSEGMENT_DPL(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])<MAX(getCPL(),getRPL(*CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment]))) //Not privileged enough to handle said segment descriptor? 48 { 49 goto invalidRETFsegment; //Next register! 50 } 51 } 52 }
Is this validation correct?
Edit: Slight improvement concerning the conforming code:
1if (RETF_segmentregister) //Are we to check the segment registers for validity during a RETF? 2 { 3 for (RETF_segmentregister = 0; RETF_segmentregister < NUMITEMS(RETF_checkSegmentRegisters); ++RETF_segmentregister) //Process all we need to check! 4 { 5 RETF_whatsegment = RETF_checkSegmentRegisters[RETF_segmentregister]; //What register to check? 6 word descriptor_index; 7 descriptor_index = getDescriptorIndex(*CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment]); //What descriptor index? 8 if (descriptor_index) //Valid index(Non-NULL)? 9 { 10 if ((word)(descriptor_index | 0x7) > ((*CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment] & 4) ? CPU[activeCPU].SEG_DESCRIPTOR[CPU_SEGMENT_LDTR].PRECALCS.limit : CPU[activeCPU].registers->GDTR.limit)) //LDT/GDT limit exceeded? 11 { 12 invalidRETFsegment: 13 //Selector and Access rights are zeroed! 14 *CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment] = 0; //Zero the register! 15 if ((isJMPorCALL&0x1FF) == 3) //IRET? 16 { 17 CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment].desc.AccessRights &= 0x7F; //Clear the valid flag only with IRET! 18 } 19 else //RETF? 20 { 21 CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment].desc.AccessRights = 0; //Invalid! 22 } 23 continue; //Next register! 24 } 25 } 26 if (GENERALSEGMENT_P(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])==0) //Not present? 27 { 28 goto invalidRETFsegment; //Next register! 29 } 30 if (GENERALSEGMENT_S(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])==0) //Not data/readable code segment? 31 { 32 goto invalidRETFsegment; //Next register! 33 } 34 //We're either data or code! 35 isnonconformingcodeordata = 0; //Default: neither! 36 if (EXECSEGMENT_ISEXEC(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])) //Code? 37 { 38 if (!EXECSEGMENT_C(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])) //Nonconforming? Invalid! 39 { 40 isnonconformingcodeordata = 1; //Next register! 41 } 42 if (!EXECSEGMENT_R(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])) //Not readable? Invalid! 43 { 44 goto invalidRETFsegment; //Next register! 45 } 46 } 47 else isnonconformingcodeordata = 1; //Data! 48 //We're either data or readable code! 49 if (isnonconformingcodeordata && (GENERALSEGMENT_DPL(CPU[activeCPU].SEG_DESCRIPTOR[RETF_whatsegment])<MAX(getCPL(),getRPL(*CPU[activeCPU].SEGMENT_REGISTERS[RETF_whatsegment])))) //Not privileged enough to handle said segment descriptor? 50 { 51 goto invalidRETFsegment; //Next register! 52 } 53 } 54 }
Just applied the fixed checks. RETF/IRET doesn't clear ES(it's still valid with the last one to a lower privilege level). So probably a pop instruction is at fault?
Edit: Searching up the chain of memory accesses, I find line 1201305, where the stack(and ebp) doesn't match your code already?
Then, at line 1201192, 0053:000002d7, ES is pushed being cleared.
Upwards, at line 1200576, 0053:00001905, ES is popped cleared(02ff in your code).
Upwards, at line , 0053:000018de, the general purpose registers(except esp) are different and ES&FS are zeroed.
Searching upwards, at line 1198067, 0053:00001905 I see ES being loaded with 33f. FS is already cleared. Otherwise same as above(ES/FS&GPregs).
Same at line 1197541, 0053:000018de.
Oddly enough, your CR0 starts with 6000?
So, this is where my log starts(already different) in protected mode:
Hmmmm.... Searching for " es" in both logs, after line 1200074(0053:0000208f 8E 06 0E 01 mov es,word ds:[010e]), I see an ES-referencing instruction that's not in your log? "0053:000020a1 26 83 7F 05 00 cmp word es:[bx+05],0000" at line 1200176? That's clearly diverging there?
Edit: Whoops, it's the same, but your log puts the segments within brackets, while the common log format uses it OUTSIDE the brackets. Also things like ptr isn't used in said format(just byte, word or dword). The same with spaces, only between instruction and the first operand(if any). Also, no h suffixes(it's assumed hex already).
In my protected-mode onoy version, I see CR0 being loaded to return to real mode 39 times after which the error message shows up. Hmmmm...
The fault ends up after the 31th time it returns to real mode... Add one during your log returning to protected mode(my log doesn't include real mode).
I can't find the error using searching that cr0 and then comparing the logs with searching " es". But looking at my own log, I see ES becoming zeroed using "mov es,di" after a ret instruction? That's odd?