VOGONS


UniPCemu 8088 cycle accuracy

Topic actions

Reply 60 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just tried Area 5150 again, and the results are.... Interesting...
https://www.dropbox.com/s/hr9nxq9zjefuifa/Uni … 2-31-40.7z?dl=0

The output of the cycle-accurate parts is less stable in retracing sometimes, but a bit more stable horizontally now?
Still lots of unfinished odd/even scanlines, though? It looks like it's triggering horizontal retrace partway through every other scanline, from outputted frames?

Otoh, the cycle-accurate horizontal timings seem to have improved a bit, with the images being more recognisable?

The weird 'noise' part at the left side of the image seems to be a repeating pattern?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 61 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just fixed up the architecture waitstates to finish at the proper cycle (instead of 1 cycle late).
Now the noise of the 16/256 colors part of 8088 MPH seems to have moved to the second column of active display?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 62 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've been thinking... If 8088 MPH reports 1720 cycles (atm), what does that mean? Does that mean that the instructions executed took too many(thus absolute cycle count of all instructions in the test combined, a straight out read difference in PIT cycles, higher count meaning more cycles/instruction) or too few cycles(it's a speed indicator, like Hz, higher count meaning fewer cycles/instruction)?

What metric cycle count does it expect? I can't find it in previous posts.

Reenigne?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 63 of 122, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

Trixter wrote the speed test, not me. But I believe it just runs a particular block of code (containing many different types of instructions) and counts how many PIT cycles it took to execute. The correct value should be 1678, so a report of 1720 means that the emulated CPU is slow - taking too many cycles to run the instructions.

Reply 64 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++
reenigne wrote on 2023-08-01, 13:50:

Trixter wrote the speed test, not me. But I believe it just runs a particular block of code (containing many different types of instructions) and counts how many PIT cycles it took to execute. The correct value should be 1678, so a report of 1720 means that the emulated CPU is slow - taking too many cycles to run the instructions.

Any idea where I can find how many cycles each CPU opcode is supposed to take? The last version of your emulator I used a while back to fix UniPCemu's instruction timings wasn't based on running the microcode in an interpreter and showed the exact cycles I implemented into my emulator for the instructions.

Atm a prefix being fetched and interpreted ticks 3 cycles in total with the current EU fetching routine (1 cycle to fetch the prefix, 1 cycle if a prefix is detected (and jumping back to the first fetching) and finally usually 1 more cycle because the fetching couldn't fetch the next opcode from the PIQ(or 1 cycle for the next prefix or opcode, which is the same as the first cycle mentioned, but for the next prefix/opcode instead).

So it's basically:
1. Fetch from PIQ (1 EU cycle on both success and failure(which aborts)).
2. If prefix, 1 EU cycle and back to step 1 to try next. Otherwise, no cycles and continue on to next phase (which is 0F handling for 286+, direct passthrough to modr/m fetching otherwise (which takes no cycles)).

I should probably modify the loop to instead limit itself to wait automatically on the prefix fetching (and try again next cycle, as a real CPU would).

Edit: Just fixed BUS and BIU stalling to behave in much the same way. Now both are counted with BIU cycles in parallel (instead of adding to it). They still have their own effect on the BIU performing transfers though (still inhibiting it and giving different results inside the BIU and reporting).
8088 MPH now reports 1651 (2% deviation). So 27 cycles short now?

UniPCemu's BIU has two kinds of stalls: 'BUS stalls' (global stall of the entire BIU, disabling it entirely for a certain amount of clocks) and 'BIU stalls' (stalling the BIU except hardware like DMA, so just suppressing active BIU cycles (prefetching and memory/bus(I/O) accesses) from happening and keeping the bus idle).
It now performs those two stalls in parallel to each other and the cycles ticked, so they always keep ticking properly (instead of after each other or only ticking BIU stalls when not stalling the BUS).

Edit: Even with those fixes, still 1651 cycles!

Last edited by superfury on 2023-08-01, 15:11. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 65 of 122, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote on 2023-08-01, 13:57:

Any idea where I can find how many cycles each CPU opcode is supposed to take?

You can use XTCE-trace ( https://www.reenigne.org/software/xtce_trace.zip ) or MartyPC or the ISA bus sniffer on XTServer. I think the timings from these should almost all be the same as those from the previous version of XTCE (I'd be interested to know if you find any differences - I know there are a few but I had to execute millions of testcases to find them so I wouldn't expect them to show up in the 8088 MPH hardware check).

Prefix opcodes (like the fastest microcoded instructions and other non-microcoded instructions) take a minimum of 2 cycles EU time (i.e. neglecting the effect of prefetch queue starvation).

Reply 66 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++
reenigne wrote on 2023-08-01, 14:58:
superfury wrote on 2023-08-01, 13:57:

Any idea where I can find how many cycles each CPU opcode is supposed to take?

You can use XTCE-trace ( https://www.reenigne.org/software/xtce_trace.zip ) or MartyPC or the ISA bus sniffer on XTServer. I think the timings from these should almost all be the same as those from the previous version of XTCE (I'd be interested to know if you find any differences - I know there are a few but I had to execute millions of testcases to find them so I wouldn't expect them to show up in the 8088 MPH hardware check).

Prefix opcodes (like the fastest microcoded instructions and other non-microcoded instructions) take a minimum of 2 cycles EU time (i.e. neglecting the effect of prefetch queue starvation).

I'd need the raw testcases themselves from 8088 MPH's loader to do that (the part that's being timed for the metric cycle count)?

Can your emulator also load BIOS ROMS and start those (from FFFF:0)? If it can, perhaps with a simple JMP at the start (since DMA is disabled at that point anyway) perhaps I can compare them to UniPCemu's logs that way?

Edit: I just made a simple BIOS ROM for UniPCemu that will run a ROM (the segment specified at address 0, 16-bit variable) at 2000:100. It will setup a DS/ES RAM area at 3000:0, with a stack at 1000:FFFE and below.
It will also implement some basic INT 21h function 0x4C call to terminate (as well as using simple RET in the program's main routine, which calls INT 20h from offset 0(at the start of the PSP) which does the same).
BIOS interrupts 10h-1Ah are simple carry flag error out interrupts. All other of the 256 interrupt vectors simply point to a simple IRET routine in the BIOS ROM.

That terminate routine will try to detect UniPCemu (using the port E9 hack port combined with UniPCemu giving an ID to identify the emulator) and if detected will make the debugger in UniPCemu enter command mode (within the emulation) and give it a command to terminate the emulator, termination the emulator (it's function number 4, which isn't documented in the official documentation yet (mainly because it isn't released yet)).

Tested with your COM program in that ZIP-archive of your last post (test.com) and it runs like a charm.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 67 of 122, by GloriousCow

User metadata
Rank Member
Rank
Member
superfury wrote on 2023-08-01, 15:34:

I'd need the raw testcases themselves from 8088 MPH's loader to do that (the part that's being timed for the metric cycle count)?

I have cycle traces I can share of the CPU test via MartyPC and reenigne's xtserver, so you don't need to repeat that work, but I'm still in the process of cleaning it up. I was intending to write a blog article on the 8088MPH cpu test itself; so if you can wait a bit I can give you everything.

in the meantime, here's the raw CPU test routine, starting at mov ax, 1234h. The first four lines simply restart the DMA timer channel #1 so that you get a predictable result. You can comment lines 13 and 14 to disable DMA. There is no PIT sampling or anything, so you are on your own to measure how long this takes. This is a raw binary, if you want a com file, set org to 100h and reassemble with nasm.

with DMA off, from 'mov' to final 'pop' should take 6328 cpu cycles.

I'm working on a more heavily instrumented version of the 8088MPH cpu test that duplicates the logic that the demo does before and after this routine - the demo does an adjustment based on some measurements of sampling the timer in a loop after the routine has run, so doing a direct comparison is difficult.

Attachments

  • Filename
    8088mph_cpu_test_raw.zip
    File size
    2.25 KiB
    Downloads
    32 downloads
    File comment
    8088mph raw cpu test
    File license
    Fair use/fair dealing exception

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Reply 68 of 122, by GloriousCow

User metadata
Rank Member
Rank
Member
superfury wrote on 2023-08-01, 12:38:

What metric cycle count does it expect? I can't find it in previous posts.

It's an adjusted delta of PIT ticks. Timer #0 is set at the start and read at the end of the routine. the difference is your 'score'. then an adjustment is made I don't fully understand, but ends up subtracting ~100ish ticks from the result.

Passing values are between 1668 and 1688 PIT ticks. A result beneath 1668 indicates you ran the test too fast; a result above 1688 indicates you ran the test too slowly.

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Reply 69 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++
GloriousCow wrote on 2023-08-01, 20:40:
superfury wrote on 2023-08-01, 12:38:

What metric cycle count does it expect? I can't find it in previous posts.

It's an adjusted delta of PIT ticks. Timer #0 is set at the start and read at the end of the routine. the difference is your 'score'. then an adjustment is made I don't fully understand, but ends up subtracting ~100ish ticks from the result.

Passing values are between 1668 and 1688 PIT ticks. A result beneath 1668 indicates you ran the test too fast; a result above 1688 indicates you ran the test too slowly.

It should be easy to run that with my new custom BIOS ROM I've made to check those cycles out (although it's ran as a MS-DOS executable at address 2000:0100, requiring the org 0 to become org 0x100 inside the executable).

And I got lucky when building my custom BIOS ROM for testing such a ROM: when using the current version of biosromexec.asm (found inside the assembly folder of UniPCemu's), when running the BIOSerrIVT IVT routine, when executing the "or word [bp+6],1" instruction, I see that the EU only reads the first 3 out of 4 bytes, crashing the whole thing!

   191 00000104 834E0601                	or word [bp+6],1 ;Set result carry flag

That final 01h byte isn't read from prefetch, crashing the ROM and making it return incorrectly almost immediately afterwards!

A log of what's happening inside UniPCemu:

	RealRAM(p):00020111=10(); RAM(p):00020111=10(); Physical(p):00020111=10(); Paged(p):00020111=10(); Normal(p):00000111=10()
2000:0000010e
2000:0000010e
2000:0000010e B0 43 mov al,43 RealRAM(p):00020112=b4(´); RAM(p):00020112=b4(´); Physical(p):00020112=b4(´); Paged(p):00020112=b4(´); Normal(p):00000112=b4(´)
Registers:
AX: 0e00 BX: 010b CX: 0000 DX: 0042
SP: fffe BP: 0000 SI: 0000 DI: 0000
CS: 2000 DS: 3000 ES: 3000 SS: 1000
IP: 010e FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
2000:00000110 RealRAM(p):00020113=09( ); RAM(p):00020113=09( ); Physical(p):00020113=09( ); Paged(p):00020113=09( ); Normal(p):00000113=09( )
2000:00000110
2000:00000110 CD 10 int 10
RealRAM(p):00020114=8c(Œ); RAM(p):00020114=8c(Œ); Physical(p):00020114=8c(Œ); Paged(p):00020114=8c(Œ); Normal(p):00000114=8c(Œ)
RealRAM(r):00000040=00( ); RAM(r):00000040=00( ); Physical(r):00000040=00( ); Paged(r):00000040=00( )
RealRAM(r):00000041=01(); RAM(r):00000041=01(); Physical(r):00000041=01(); Paged(r):00000041=01()
RealRAM(r):00000042=00( ); RAM(r):00000042=00( ); Physical(r):00000042=00( ); Paged(r):00000042=00( )
RealRAM(r):00000043=f0(ð); RAM(r):00000043=f0(ð); Physical(r):00000043=f0(ð); Paged(r):00000043=f0(ð)
RealRAM(p):00020115=c8(È); RAM(p):00020115=c8(È); Physical(p):00020115=c8(È); Paged(p):00020115=c8(È); Normal(p):00000115=c8(È)
Paged(w):0001fffc=02()
Paged(w):0001fffc=02()
Paged(w):0001fffc=02(); Physical(w):0001fffc=02(); RAM(w):0001fffc=02(); RealRAM(w):0001fffc=02()
Paged(w):0001fffd=f0(ð)
Paged(w):0001fffd=f0(ð)
Paged(w):0001fffd=f0(ð); Physical(w):0001fffd=f0(ð); RAM(w):0001fffd=f0(ð); RealRAM(w):0001fffd=f0(ð)
Paged(w):0001fffa=00( )
Paged(w):0001fffa=00( )
Paged(w):0001fffa=00( ); Physical(w):0001fffa=00( ); RAM(w):0001fffa=00( ); RealRAM(w):0001fffa=00( )
Paged(w):0001fffb=20( )
Paged(w):0001fffb=20( )
Paged(w):0001fffb=20( ); Physical(w):0001fffb=20( ); RAM(w):0001fffb=20( ); RealRAM(w):0001fffb=20( )
Paged(w):0001fff8=12()
Paged(w):0001fff8=12()
Paged(w):0001fff8=12(); Physical(w):0001fff8=12(); RAM(w):0001fff8=12(); RealRAM(w):0001fff8=12()
Paged(w):0001fff9=01()
Paged(w):0001fff9=01()
Paged(w):0001fff9=01(); Physical(w):0001fff9=01(); RAM(w):0001fff9=01(); RealRAM(w):0001fff9=01()
00:00:27:82.02238: Interrupt 10=F000:00000100@2000:0112(CD); ERRORCODE: FFFFFFFE
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fffe BP: 0000 SI: 0000 DI: 0000
CS: 2000 DS: 3000 ES: 3000 SS: 1000
IP: 0110 FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
Physical(p):000f0100=50(P); Paged(p):000f0100=50(P); Normal(p):00000100=50(P)
f000:00000100 50 push ax Paged(p):000f0101=55(U); Normal(p):00000101=55(U)
Paged(p):000f0102=89(‰); Normal(p):00000102=89(‰)
Paged(w):0001fff6=43(C)
Paged(w):0001fff6=43(C)
Paged(w):0001fff6=43(C); Physical(w):0001fff6=43(C); RAM(w):0001fff6=43(C); RealRAM(w):0001fff6=43(C)
Paged(w):0001fff7=0e()
Paged(w):0001fff7=0e()
Paged(w):0001fff7=0e(); Physical(w):0001fff7=0e(); RAM(w):0001fff7=0e(); RealRAM(w):0001fff7=0e()
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fff8 BP: 0000 SI: 0000 DI: 0000
CS: f000 DS: 3000 ES: 3000 SS: 1000
IP: 0100 FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
Paged(p):000f0103=e5(å); Normal(p):00000103=e5(å)
Show last 90 lines
f000:00000101 55 push bp
Paged(p):000f0104=83(ƒ); Normal(p):00000104=83(ƒ)
Paged(w):0001fff4=00( )
Paged(w):0001fff4=00( )
Paged(w):0001fff4=00( ); Physical(w):0001fff4=00( ); RAM(w):0001fff4=00( ); RealRAM(w):0001fff4=00( )
Paged(w):0001fff5=00( )
Paged(w):0001fff5=00( )
Paged(w):0001fff5=00( ); Physical(w):0001fff5=00( ); RAM(w):0001fff5=00( ); RealRAM(w):0001fff5=00( )
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fff6 BP: 0000 SI: 0000 DI: 0000
CS: f000 DS: 3000 ES: 3000 SS: 1000
IP: 0101 FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
Paged(p):000f0105=4e(N); Normal(p):00000105=4e(N)
f000:00000102 89 E5 mov bp,sp
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fff4 BP: 0000 SI: 0000 DI: 0000
CS: f000 DS: 3000 ES: 3000 SS: 1000
IP: 0102 FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
Paged(p):000f0106=06(); Normal(p):00000106=06()
Paged(p):000f0107=01(); Normal(p):00000107=01()
Paged(p):000f0108=5d(]); Normal(p):00000108=5d(])
Paged(p):000f0109=58(X); Normal(p):00000109=58(X)
f000:00000104 83 4E 06 or word ss:[bp+06],0010
Paged(p):000f010a=cf(Ï); Normal(p):0000010a=cf(Ï)
RealRAM(r):0001fffa=00( ); RAM(r):0001fffa=00( ); Physical(r):0001fffa=00( ); Paged(r):0001fffa=00( )
RealRAM(r):0001fffb=20( ); RAM(r):0001fffb=20( ); Physical(r):0001fffb=20( ); Paged(r):0001fffb=20( )
Paged(p):000f010b=55(U); Normal(p):0000010b=55(U)
Paged(w):0001fffa=01()
Paged(w):0001fffa=01()
Paged(w):0001fffa=01(); Physical(w):0001fffa=01(); RAM(w):0001fffa=01(); RealRAM(w):0001fffa=01()
Paged(w):0001fffb=20( )
Paged(w):0001fffb=20( )
Paged(w):0001fffb=20( ); Physical(w):0001fffb=20( ); RAM(w):0001fffb=20( ); RealRAM(w):0001fffb=20( )
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fff4 BP: fff4 SI: 0000 DI: 0000
CS: f000 DS: 3000 ES: 3000 SS: 1000
IP: 0104 FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
f000:00000108 5D pop bp Paged(p):000f010c=89(‰); Normal(p):0000010c=89(‰)
RealRAM(r):0001fff4=00( ); RAM(r):0001fff4=00( ); Physical(r):0001fff4=00( ); Paged(r):0001fff4=00( )
RealRAM(r):0001fff5=00( ); RAM(r):0001fff5=00( ); Physical(r):0001fff5=00( ); Paged(r):0001fff5=00( )
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fff4 BP: fff4 SI: 0000 DI: 0000
CS: f000 DS: 3000 ES: 3000 SS: 1000
IP: 0108 FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
f000:00000109 58 pop ax Paged(p):000f010d=e5(å); Normal(p):0000010d=e5(å)
RealRAM(r):0001fff6=43(C); RAM(r):0001fff6=43(C); Physical(r):0001fff6=43(C); Paged(r):0001fff6=43(C)
RealRAM(r):0001fff7=0e(); RAM(r):0001fff7=0e(); Physical(r):0001fff7=0e(); Paged(r):0001fff7=0e()
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fff6 BP: 0000 SI: 0000 DI: 0000
CS: f000 DS: 3000 ES: 3000 SS: 1000
IP: 0109 FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
f000:0000010a CF iret Paged(p):000f010e=83(ƒ); Normal(p):0000010e=83(ƒ)
RealRAM(r):0001fff8=12(); RAM(r):0001fff8=12(); Physical(r):0001fff8=12(); Paged(r):0001fff8=12()
RealRAM(r):0001fff9=01(); RAM(r):0001fff9=01(); Physical(r):0001fff9=01(); Paged(r):0001fff9=01()
RealRAM(r):0001fffa=01(); RAM(r):0001fffa=01(); Physical(r):0001fffa=01(); Paged(r):0001fffa=01()
RealRAM(r):0001fffb=20( ); RAM(r):0001fffb=20( ); Physical(r):0001fffb=20( ); Paged(r):0001fffb=20( )
RealRAM(r):0001fffc=02(); RAM(r):0001fffc=02(); Physical(r):0001fffc=02(); Paged(r):0001fffc=02()
RealRAM(r):0001fffd=f0(ð); RAM(r):0001fffd=f0(ð); Physical(r):0001fffd=f0(ð); Paged(r):0001fffd=f0(ð)
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fff8 BP: 0000 SI: 0000 DI: 0000
CS: f000 DS: 3000 ES: 3000 SS: 1000
IP: 010a FLAGS: f002
FLAGSINFO: 1111oditsz0a0p1c
RealRAM(p):00020122=c0(À); RAM(p):00020122=c0(À); Physical(p):00020122=c0(À); Paged(p):00020122=c0(À); Normal(p):00000112=c0(À)
RealRAM(p):00020123=bd(½); RAM(p):00020123=bd(½); Physical(p):00020123=bd(½); Paged(p):00020123=bd(½); Normal(p):00000113=bd(½)
RealRAM(p):00020124=50(P); RAM(p):00020124=50(P); Physical(p):00020124=50(P); Paged(p):00020124=50(P); Normal(p):00000114=50(P)
RealRAM(p):00020125=01(); RAM(p):00020125=01(); Physical(p):00020125=01(); Paged(p):00020125=01(); Normal(p):00000115=01()
2001:00000112 RealRAM(p):00020126=b9(¹); RAM(p):00020126=b9(¹); Physical(p):00020126=b9(¹); Paged(p):00020126=b9(¹); Normal(p):00000116=b9(¹)
2001:00000112
2001:00000112
2001:00000112 C0 BD 50 ret 50bd
RealRAM(p):00020127=16(); RAM(p):00020127=16(); Physical(p):00020127=16(); Paged(p):00020127=16(); Normal(p):00000117=16()
RealRAM(r):0001fffe=00( ); RAM(r):0001fffe=00( ); Physical(r):0001fffe=00( ); Paged(r):0001fffe=00( )
RealRAM(r):0001ffff=00( ); RAM(r):0001ffff=00( ); Physical(r):0001ffff=00( ); Paged(r):0001ffff=00( )
Registers:
AX: 0e43 BX: 010b CX: 0000 DX: 0042
SP: fffe BP: 0000 SI: 0000 DI: 0000
CS: 2001 DS: 3000 ES: 3000 SS: 1000
IP: 0112 FLAGS: f002

Edit: Eventually managed to fix it the issues.

There were some little things going wrong:
- The ROM code used the wrong pushed BP location for the base of the frame in the IVT handler.
- (Which caused the above bullet point) The video vector was written to the 0x10 th address instead of offset 40h inside the IVT (so into the 4th instead of 16th interrupt vector).
- The test ROM COM program (biosromexectest.asm) was setting AH, next updating DS or ES to CS(for the segment to read the string values from by the MS-DOS or video interrupt handler) using AX as an intermediate (thus writing 2000h (CS) into the register incorrectly).
- Finally, the INT21h "rep scasb" was using ES (as it should), which was pointing to whatever the program was using (the program's data segment supplied by the BIOS ROM, which is 3000h), even though it used DS to supply the INT21h string. The "rep scasb" was thus scanning through a VERY large memory chunk (about 64KB) or memory, causing UniPCemu to start becoming unresponsive due to the extremely large (256KB already) string being generated for the many memory accesses to pretty much the entire 64KB area (except for the first 130h-ish bytes, because that's where the string is supposed to start according to DI).
Other than UniPCemu apparently cannot logging such a huge chunk at once (over 32K memory addresses being logged into a single huge 256KB string at the point I noticed it happening!), it was obviously scanning the wrong memory area (because of improper ES segment being used).

Having fixed that, the test ROM (that tests all BIOS function calls (biosromexectest.asm)) as well as your example ROM (not the 8088 MPH one, but the xtce_trace.zip 's test.com file) run fine with the BIOS COM ROM I've made so far (biosromexec.asm).

The assembly files I mentioned can both be found within UniPCemu's assembly folder.

The logging functionality of the INT 21h function 02/06/09 and INT 10h function 0E/13 is limited to the Bochs E9 hack though.

And when the BIOS ROM gets the command to terminate the app (INT 21h function 4Ch, INT 20h or RET (which jumps to 2000:0000, which contains an INT 20h instruction, placed there according to MS-DOS PSP header specs (that's all that's in the PSP atm though)), it will check for presence of UniPCemu (using the port E9 hack reading functionality (if reading E9h from port E9h it's considered present). Then if the port E9h is present, it will read I/O port EAh (specific to UniPCemu) to read the identifier string (ending with FFh) to reset it, read it again until encountering FFh and buffering it (so it has the whole string in memory). Then it write said string to the port E9h hack (this won't get logged, as UniPCemu recognises it to enter command mode on port EAh), then uses the EAh port to terminate the emulator. If an empty string is read (which happens when the port isn't present, as the bus floats), it will detect UniPCemu not present. If UniPCemu isn't detected present (other emulators) or the command to terminate UniPCemu errors out for whatever reason (usually the case on older builds, as they don't implement the new command (it's only just implemented in the latest commits)), it will fallback onto a plain CLI HLT combination to stop the CPU execution instead (Which requires the user to manually close the app once the user notices that it isn't doing anything anymore or the speed becomes ridiculously high (because the CPU is idling permanently)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 70 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just managed to compile your 8088MPH executable with the segment prefixes (CS: etc.) moved inside the brackets (the nasm compiler was complaining about them being "error: invalid combination of opcode and operands").

Now that I have the ROM, I should be able to run it using the BIOS ROM (with of course the org statement at the start of the assembly code changed to 100h to fix the offsets used).

Placing a breakpoint at F000:0108 inside the current BIOS COM ROM should be at the RETF statement that starts the execution of the option ROM. My BIOS COM ROM neither sets the PIT up nor the DMA though, so no DRAM refresh atm (so only emulators).

Of course never forget to update the segment selector inside the built ROM (at offset 0) using a hex editor to make it contains the segment the emulator loads the option ROM at (otherwise, it will copy the 64KB at segment c600:0 to segment 2000:0100h and run whatever is read from that location (which if there's another ROM or nothing (floating bus or garbage buffers in the case of the 8088), which might not be what you want to run.

One nice thing about the BIOS ROM indirection (instead of only an option ROM that RETs to about to some emulator-specific location) is that it adds (if port E9h hack is implemented) easy DOS/BIOS-compatible logging support (using normal interrupts, if the port E9 hack is supported (on UniPCemu, Bochs etc.)) as well as a simple exit handler that can terminate UniPCemu (and compatibles using the port E9 hack with) or stop the CPU (other emulators) using it's simple DOS-compatible barebones functionality (just the exit handlers and character/string write functionality to the E9 hack I/O port, all other calls erroring out (11h through 1Ah) and most interrupt vectors defaulting to a simple IRET (which is all but interrupt 21h, 20h and 10h)).

I will run the generated option ROM for the cycle logging at another time (currently getting a bit late into the night). It's already compiled and ready to go (as mentioned earlier in this post). I just need to run it with my BIOS COM ROM (which is adjusted to the segment the ROM is loaded at (although that's only visible within the debugger UniPCemu is ran with though (by looking at some of the internal variables atm. Perhaps I should add a simple log that logs at what address the ROMs are loaded once the loading process completes. That way users can actually see where the ROMs are located in the emuated memory for troubleshooting etc. (like other emulators using ROMs do) instead of having to use a debugger to debug the emulator itself to do that job...
But that's for another time.

Also, F000:007D is the start of the terminate vector (the start of the INT 20h vector).

Edit: First results (although wrong logging mode):

Filename
debugger_8088MPHcycles.log
File size
346.01 KiB
Downloads
30 downloads
File comment
8088 MPH cycle log - Proof of concept
File license
Fair use/fair dealing exception

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 71 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

Managed to fix an issue with the debugger not logging T-states when requestsed to do so with the common log format log settings.

Although the cycle count seems weird for the first JMP instruction in that case (only 1 transfer for the entire instruction) in cycle-accurate logging of the BIOS COM ROM and extremely fast finishing after that (less than 4 cycles)? Hmmm....

BIU T1 -	
BIU T2 -
BIU T3 - Physical(p):000ffff0=ea(ê); Paged(p):000ffff0=ea(ê); Normal(p):00000000=ea(ê)
BIU T4 -
BIU T1 I
BIU T2 -
BIU T3 - ffff:00000000 Paged(p):000ffff1=02(); Normal(p):00000001=02()
BIU T4 - ffff:00000000
BIU T1 S ffff:00000000
BIU T2 - ffff:00000000
BIU T3 - ffff:00000000 Paged(p):000ffff2=00( ); Normal(p):00000002=00( )
BIU T4 - ffff:00000000
BIU T1 S ffff:00000000
BIU T2 - ffff:00000000
BIU T3 - ffff:00000000 Paged(p):000ffff3=00( ); Normal(p):00000003=00( )
BIU T4 - ffff:00000000
BIU T1 S ffff:00000000
BIU T2 - ffff:00000000
BIU T3 - ffff:00000000 Paged(p):000ffff4=f0(ð); Normal(p):00000004=f0(ð)
BIU T4 - ffff:00000000
BIU T1 S ffff:00000000
BIU T2 - ffff:00000000
BIU T3 E ffff:00000000 EA 02 00 00 F0 jmp f000:00000002 Physical(p):000f0002=b8(¸); Paged(p):000f0002=b8(¸); Normal(p):00000002=b8(¸)
Registers:
AX: 0000 BX: 0000 CX: 0000 DX: 0000
SP: 0000 BP: 0000 SI: 0000 DI: 0000
CS: ffff DS: 0000 ES: 0000 SS: 0000
IP: 0000 FLAGS: f002
Previous CS:IP: 0000:0000
FLAGSINFO: 1111oditsz0a0p1cR
BIU T4 -
BIU T1 -
BIU T2 I
BIU T3 - Paged(p):000f0003=00( ); Normal(p):00000003=00( )
BIU T4 - f000:00000002
BIU T1 S f000:00000002
BIU T2 - f000:00000002
BIU T3 - f000:00000002 Paged(p):000f0004=00( ); Normal(p):00000004=00( )
BIU T4 - f000:00000002
BIU T1 S f000:00000002
BIU T2 - f000:00000002
BIU T3 - f000:00000002 B8 00 00 mov ax,0000 Paged(p):000f0005=8e(Ž); Normal(p):00000005=8e(Ž)
Registers:
AX: 0000 BX: 0000 CX: 0000 DX: 0000
SP: 0000 BP: 0000 SI: 0000 DI: 0000
CS: f000 DS: 0000 ES: 0000 SS: 0000
IP: 0002 FLAGS: f002

Is 3 cycles for opcode EA correct?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 72 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

And the full cycle log of the 8088 metric batch using the BIOS COM ROM:

Filename
debugger 8088 cycle count extracted 20230802_1031.7z
File size
16.25 KiB
Downloads
30 downloads
File comment
8088 metric cycle batch
File license
Fair use/fair dealing exception

Edit: Simplified with the extra clutter disabled in UniPCemu (advanced logging and register logs):

Filename
debugger 8088 cycle count simplified.7z
File size
12.99 KiB
Downloads
30 downloads
File comment
8088 metric cycle batch simplified
File license
Fair use/fair dealing exception

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 73 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just improved the instruction logging to:
- Log executed instruction and debugging information once the execution finishes.
- Log instruction address once the execution stars until the execution finishes (so the entire execution phase is visible).

So a 2-cycle EU far JMP absolute address instruction logs would belike, for example:
... (fetching phase, no address logged)
BIU T1 2000:0100 *first cycle of EU phase
BIU T2 2000:0100 JMP 2000:0105 *final cycle of EU phase
(next instruction starts)

Edit: New 8088 batch logged:

Filename
debugger 8088 cycle count simplified v2.7z
File size
15.28 KiB
Downloads
29 downloads
File comment
New log of the execution of the 8088 batch
File license
Fair use/fair dealing exception

Edit: And the fix for the instructions that were giving invalid disassembly immediate operands (because they weren't read yet before setting the debugger). The breakpoints inside UniPCemu are:
- F000:0108S (the point of the RETF instruction into the COM file RAM copy)
- F000:007DS (the point called during INT 20h)

They are mostly a bugfix for the 80-83h range of opcodes, which were generating invalid disassembly because the immediate wasn't loaded yet.

Filename
debugger 8088 cycle count simplified v3.7z
File size
15.54 KiB
Downloads
33 downloads
File comment
New log of the execution of the 8088 batch with fixed 80h-83h instruction disassembly.
File license
Fair use/fair dealing exception

Edit: Another version improvement. The final cycle of the prefetching (moved to after the REP-specific pre-EU timings if any are present) was being moved 1 cycle later than it should (ticking a cycle after the final PIQ fetch cycle or after the final cycle of the REP-specific timings when it shouldn't).

Filename
debugger 8088 cycle count simplified v4.7z
File size
15.19 KiB
Downloads
28 downloads
File comment
New log of the execution of the 8088 batch with fixed 80h-83h instruction disassembly and fixing the final cycle of fetching before the EU starts.
File license
Fair use/fair dealing exception

Though it includes the interrupt message (line 6195) due to advanced debugging as well as the termination interrupt and it's first instruction (at 2000:0000 and f000:007d).

I'm counting 6107 cycles from mov's fetching by the BIU(the fetching of the first instruction byte) through to the final cycle of pop word ds:[bx] at 2000:0441.
So it's 221 cycles short somehow?
Edit: Tried the bus sniffer method. Uploaded the original file from the zip file. Got no output.
Then thought again: perhaps it'll need some extra sniffer code?
And I was right (totally forgot about that!): https://www.reenigne.org/blog/isa-bus-sniffer-update/

Thus compiled it (it compiled without any issues right away. Just needed to remove the "../" on the include paths and replace the testRoutine code (including the ret) with an %include to the code of the 8088raw02.asm and remove the org directive inside it (as it messes up the sniffer otherwise)!

Now that it's compiled, I should be able to get my sniffer log to compare! 😁

Though didn't use the build.bat but a simple Makefile instead (and nasm instead of yasm).

Edit: Hmmm... It only seems to execute roughly half of the code on the bus sniffer (151 out of 343 IP difference)?

Do beware however that the file is compiled in much the same way as the UniPCemu compile paths work (result being at ../projects_build/UniPCemu/8088raw02_sniffer.bin), using the Makefile.

Filename
8088raw02_snifferresults.zip
File size
25.52 KiB
Downloads
28 downloads
File comment
NASM files of the used binary on the bus sniffer and the result from the server.
File license
Fair use/fair dealing exception

It seems to stop at "CMP DL,AL" inside the batch, just before the 3E LODSW instruction (2048th cycle!).

Edit: Just fixed the BIOS COM ROM to use CS=DS=ES for the executable instead of DS=ES=3000h.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 74 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've been thinking... Is there a difference in bus stalls?

As in, is a BIU stall and BUS stall the same or are they different in some way internally to the BIU?

Edit: As a side note, just implemented DRAM refresh setup for the DMA and PIT (PIT channel 1 and DMA channel 0 only, others unaffected) into the startup of the BIOS COM ROM.
So that should fix it to work on real hardware as well, as well as having proper DRAM refresh when timing.
Edit: Just added support with two extra variables to specify the COM ROM offset and length of the executable as well, allowing for loading from, for example, it's own ROM area if required (like in the XT server for example, which can only load 1 ROM (although in RAM)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 75 of 122, by GloriousCow

User metadata
Rank Member
Rank
Member
superfury wrote on 2023-08-04, 06:41:
I've been thinking... Is there a difference in bus stalls? […]
Show full quote

I've been thinking... Is there a difference in bus stalls?

As in, is a BIU stall and BUS stall the same or are they different in some way internally to the BIU?

Edit: As a side note, just implemented DRAM refresh setup for the DMA and PIT (PIT channel 1 and DMA channel 0 only, others unaffected) into the startup of the BIOS COM ROM.
So that should fix it to work on real hardware as well, as well as having proper DRAM refresh when timing.
Edit: Just added support with two extra variables to specify the COM ROM offset and length of the executable as well, allowing for loading from, for example, it's own ROM area if required (like in the XT server for example, which can only load 1 ROM (although in RAM)).

I don't have a separate concept of a "bus stall" i only handle BIU stalls (or suspensions, depending on your terminology)
These are considered different from fetch delays; in that BIU stalls will affect every bus operation after, not just fetches

I'd be interesting comparing logic between you, reenigne and myself regarding stall/delay triggers and timing.
I do BIU stalls when
- a SUSP microcode operation is issued
- the prefetcher attempts a CODE fetch bus cycle when the queue is full

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Reply 76 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++
GloriousCow wrote on 2023-08-04, 12:39:
I don't have a separate concept of a "bus stall" i only handle BIU stalls (or suspensions, depending on your terminology) These […]
Show full quote
superfury wrote on 2023-08-04, 06:41:
I've been thinking... Is there a difference in bus stalls? […]
Show full quote

I've been thinking... Is there a difference in bus stalls?

As in, is a BIU stall and BUS stall the same or are they different in some way internally to the BIU?

Edit: As a side note, just implemented DRAM refresh setup for the DMA and PIT (PIT channel 1 and DMA channel 0 only, others unaffected) into the startup of the BIOS COM ROM.
So that should fix it to work on real hardware as well, as well as having proper DRAM refresh when timing.
Edit: Just added support with two extra variables to specify the COM ROM offset and length of the executable as well, allowing for loading from, for example, it's own ROM area if required (like in the XT server for example, which can only load 1 ROM (although in RAM)).

I don't have a separate concept of a "bus stall" i only handle BIU stalls (or suspensions, depending on your terminology)
These are considered different from fetch delays; in that BIU stalls will affect every bus operation after, not just fetches

I'd be interesting comparing logic between you, reenigne and myself regarding stall/delay triggers and timing.
I do BIU stalls when
- a SUSP microcode operation is issued
- the prefetcher attempts a CODE fetch bus cycle when the queue is full

Huh? A code fetch bus cycle when the queue is full? Can that even happen?
Doesn't the BIU perform a 1-cycle NOP essentially (no bus activity) when the PIQ is full and the EU isn't requesting any reads/write to bus/memory?

Although UniPCemu does perform those stalls, it performs two kinds of stalls essentially:
- RAM waitstate stalls: Happens during requesting memory or I/O(bus as I usually call it) with waitstates (like the 1-cycle I/O waitstate on XT, others apply to AT and up and the 386/486 XT/AT Inboard as well, which is configurable).
- EU-induced waitstates (done by jumps, calls etc., pretty much anything that affects IP as an EU instruction phase and not fetching (opcode, modr/m category on 80(1)86, immediate as well on 286+)).

I've just improved the BIU a bit. Now it won't claim the bus (preventing DMA) if the T1/T2 states or T3 waitstates are being performed. Thus allowing DMA to intervene during T3 or T1 (but not T4). It also allows DMA using HLDA if the T3 is performing waitstates on a cycle, delaying the next T3 without waitstate recheck to after the DMA transfer releases the bus.

Stalling the bus (EU-based) also prevents HLDA from being raised (forced low).

Edit: Just checked the stalls again:
- There's a waitstate that's just ticked always (waitstateRAM variable in the BIU and DMA controllers).
- There's a stallBUS variable that's ticked for opcodes/interrupts delaying the BIU for a certain amount of cycles (overriding all T1-T4 cycles to become -- cycles in the logging).
- There's a stallBIU variable that's blocking prefetches and RAM accesses (T3&T4 on 808x, T1 on 286+) that's used by all jumps and calls for the entire duration of the jmp/call timing of the final clocks of the instruction (for example 5 or 3 cycles at the end of a RET). Pretty much always after the jmp has been done and a few cycles depending on the instruction are ticked before the instruction finishes executing and the next instruction starts.

Those 5 cycles of a RETF ending at the start of the log (I usually remove those before posting) can also be counted as such (those are the stallBIU variant of BIU delays).

Edit: Latest 8088tst3 results (table cleaned up for the next batch):

real:	disp1:	comp1:	
FF43 FF38 <(-11)
FE59 FE48 <(-17)
FDC5 FDB3 <(-18)
FD58 FD44 <(-20)
FD2A FD1A <(-16)
FC6B FC63 <(-8)
FBB7 FBAE <(-9)
F9A9 F990 <(-25)
CPU test complete. Elapsed timer ticks:
07CA 0786 <(-68)

8088 MPH reports 1590 (5% deviation) metric cycles now.

Edit: A capture of Area 5150 running now:
https://www.dropbox.com/s/wwixn26egqyc7w9/Uni … 0-39-06.7z?dl=0

It seems to have slightly improved a bit on the credits, still wrong otherwise, although mostly when rendering the text from left to right on the screen, as well as vertical seeming a bit unstable in some parts?

It's recognisable now, at least.
Still it's interesting that both the 'correct' credits are displayed (as in mostly on the horizontal timings) with the same issue for the first block (if you can call it that) of the display (say first 32 character clocks or something like that), but the chaplin part (don't know if it has an official name) still seems to displayed mostly incorrectly.

Also, does anyone know what happens at the end of the UFO part? When running within UniPCemu, the UFO snaps to the top of active display, with the remaining of the layers immediately following it, but no black bars that widen until the whole screen fades out is happening? If it performing something undocumented there?
Afaik, UniPCemu should direct the bit 3 of the CGA Mode Control Register to display the overscan color when rendering active display (overriding the active display signal)? Is that correct behaviour?
Edit: Actually, what UniPCemu does is the following with bit 3 of the CGA Mode control register:
- Clearing bit 3 sets the VGA Sequencer clocking mode register bit 5, which triggers blanking (basically inverting it) on active display and overscan.
- UniPCemu special-cases blanking during graphics mode to replace blanking with the overscan color.

Another interesting thing happens when UniPCemu displays the text 'can you teach an elephant to tap-dance' (this part: https://scalibq.files.wordpress.com/2022/09/s … roll.jpg?w=1024). Both the area above where the text is scrolling into and out of display range (horizontally) the thick-small-small lines (and reversed for the bottom) where the text is being displayed show the entire area as one big block of overscan? Is that supposed to happen? What is the application doing for those scanlines to appear that way?
I do see those weird black vertical scrolling blocks over the text at the left border, like in https://www.youtube.com/watch?v=O1j5ycBXXcc&t=331s , but the area above and below it (until the pure overscan blocks at the top and bottom of the screen) display overscan's color only for the entire area where the text is displayed? Can anyone explain what the demo does at those scanlines to render those thick-small-small and small-small-thick scanlines during the horizontal area where it displays the text? UniPCemu just displays all those scanlines as being the overscan color?

Perhaps the issue with overscan on the sine scroller (as Scali called it on his blog, here) and the fade-in and fade-out of the parallax scroller are the same issue in UniPCemu? Both are supposed to blank some scanlines, but not others, which obviously doesn't happen somehow?
https://www.dropbox.com/s/n7nuu2t41ss7eg7/Uni … 1-25-32.7z?dl=0

Edit: The ANSI animation at the end of the demo also fails somehow? The characters aren't displayed normally, showing up at full height while scrolling the text left and right horizontally on the screen, instead of proper about 16 lines height (instead being full blown 8 pixel text lines with full characters in them it seems?).

Edit: Hmmm... Also, the vector part of the second vector object sometimes looks like it's rendering the vector in some text-mode style rendering incorrectly (the entire graphics being reduced to horizontal stripes and blocky blocks, like some low-res text being partly drawn as a weird color on black background, when the object is in the top 1/4 of the 3D vector area screen it looks like (whenever the border moves within range of 1/4 of the top of the rendering area for the vector to bob up and down within)?

The credits seem semi-stable, with although a shaky display and the left plant (until roughly where "GARGAJ" starts or the U of "Ut2,") part being duplicated and stable display for the erasure part, but unstable for the rendering of text in the red bars (although vertically it loses sync sometimes and an interesting 'fill' like a percent bar filling from 0 to 100% at the left overscan filling with green pixels from a black screen in between sometimes). Vertically it seems to shake what looks like up and down exactly 1 scanline when rendering the credits image, with horizontally shaking exactly the width of the left side of the D character (exactly 4 or 8 pixels it seems?) (the "|" part of "|)") at the same time? Also, the left border of the credits is filled with a duplicate of the first part of rendered image and water effect at the bottom instead of being a overscan color? Can anyone explain what would cause this kind of weird behaviour?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 77 of 122, by GloriousCow

User metadata
Rank Member
Rank
Member
superfury wrote on 2023-08-04, 15:48:
GloriousCow wrote on 2023-08-04, 12:39:

I do BIU stalls when
- a SUSP microcode operation is issued
- the prefetcher attempts a CODE fetch bus cycle when the queue is full

Huh? A code fetch bus cycle when the queue is full? Can that even happen?
Doesn't the BIU perform a 1-cycle NOP essentially (no bus activity) when the PIQ is full and the EU isn't requesting any reads/write to bus/memory?

You pack so many different ideas into your replies it's hard to respond to everything, so I will just address this question for now.

A code fetch bus cycle will not begin when the queue is full. However, the decision to start a code fetch is typically made during the previous bus cycle, which itself may have been a code fetch. The decision is made on T3, when say, the queue length was 3, at that specific cycle. So the CPU schedules another fetch to occur two cycles later, on the next T1. The next cycle is then T4, and the byte that was read from the code fetch is placed into the queue, and the queue length is now 4 and the queue is now full. See, the 8088 prefetcher is kinda dumb - it doesn't think "gee, the queue length is 3 and i'm currently fetching a byte, so I shouldn't schedule anything". It just sees that the queue length of 3, so assumes there is room.

Next cycle, it is now T1 (or would be), and the prefetch scheduler says a fetch should begin now. But the queue is full - so the fetch does not begin and we do not enter a CODE bus cycle.

Let me preface what I am about to say with a bit of a caveat: I cannot say for 100% sure this how the CPU logic works or what the correct terminology is, I can only say how I model the behavior to emulate things in a way that matches my observations from the real thing and the terminology that I use is just what I came up with. Since we're in sort of undocumented territory, that's as good as you are probably going to get.

Anyway, I digress - In this state, on T1, with a scheduled fetch, and a full queue, instead of entering a CODE bus cycle, the BIU stalls. In this stalled state, all atomic requests from the EU are delayed 3 cycles, and prefetching is disabled. Prefetching will only resume when a byte has been read from the queue to make room for a new fetch. It takes 3 cycles to resume.

I don't expect you to just believe me, because this sounds pretty wild, but let's take a look at a hardware cycle trace courtesy of reenigne's xtserver:

00000	Ip....	70004	00	00	FD	01	.......							I	A4	MOVSB
00000 .C.... 70004 00 00 FD 01 .......
01217 .C.... 01217 00 00 FD 00 ....... T1
21217 .C.... 01217 FF 00 FD 03 ..r.... T2
212AD .p.... 01217 AD 00 FD 02 ..r.... T3 AD <-f [ 01217]
212AD .r.... 01217 AD 00 FD 02 ....... T4
70005 .r.... 70005 AD 00 FD 01 ....... T1
30005 .r.... 70005 7E 00 FD 00 ..r.... T2
30000 .p.... 70005 00 00 FD 03 ..r.... T3 00 <-- [DS 70005]
30000 .p.... 70005 00 00 FD 03 ....... T4
30000 .p.... 70005 00 00 FD 02 .......
30000 .p.... 70005 00 00 FD 01 .......
30000 .w.... 70005 00 00 FD 00 .......
70005 .w.... 70005 00 00 FD 00 ....... T1
00000 .w.... 70005 00 00 FD 03 ...w... T2
00000 .p.... 70005 00 00 FD 02 ...w... T3 00 --> [ES 70005]
00000 .p.... 70005 00 00 FD 01 ....... T4
00000 .p.... 70005 00 00 FD 01 .......
00000 .p.... 70005 00 00 FD 00 .......

MOVSB performs a read and a write. We can see the read cycle, T1-T4, on lines 7-10. The read operation is actually complete on Line 9, T3. On line 10, we execute a line of unrelated microcode - just something that allows the microcode for MOVS and LODS to be shared - and then there's an immediate write request from the EU on line 11. You'll notice though, that the bus cycle doesn't begin until line 14. We have to explain this delay. It's not time explained by microcode operations, so it has to be some sort of BIU delay.

The theory that I am modelling here, is that the EU request came in on line 11 - it is past the point where we could do a prefetch abort; so a code fetch tries to occur now, can't, and the BIU stalls, and so the write request incurs a 3 cycle delay. We see similar delays during instructions that use the SUSP microcode routine to suspend prefetching - I have a hunch it might be the same mechanism underlying both scenarios.

So what happens when the queue isn't full? Well, then the code fetch on line 11 can begin - and there are now four cycles and a full CODE bus cycle between the read and write. So ironically, the BIU stalling makes MOVSB execute faster; but we do get a penalty on the tail end by delaying the next code fetch.

Since stalling the BIU like this is undesirable, I have a hunch that the fetch delays encountered when the queue length is 3 in certain circumstances were intended to partially avoid this scenario.

EDIT: Ignore most of this post. I have since revised my BIU logic to simplify things: https://martypc.blogspot.com/2023/08/the-8088 … -algorithm.html

Last edited by GloriousCow on 2023-08-15, 15:39. Edited 1 time in total.

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Reply 78 of 122, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just implemented an extra timer on the BIU emulation of the 808x:
- When T3 is ticked (proceeding onto T4), it will set a flag if the prefetch isn't empty.
- When T1 arrives to tick and either said flag is set, or no request is made, an additional check is made before checking the requests from the EU:
-- (in both cases below the above flag is cleared, preventing it from retriggering until after the EU request finishes after this)
-- PIQ not full? Perform a prefetch instead.
-- PIQ full? Perform 4 idle clock cycles instead (thus taking T1-T4 cycles with idle bus).

Edit: 8088 MPH still reports 1590 metric cycles.
Edit: OK. There was an issue with both the cycle check on T3 (it checked T4 instead, which never happens), as well as not taking the minimum free size into account (so 2 or 4 bytes on newer CPUs, depending on BIU data bus size).
Edit: 8088 MPH now reports 1632 cycles (3% deviation).

The 8088 MPH sprite compiler starts blinking again during the final 2-4 scanlines of the sprite scrolling off the screen.

Edit: Hmmm.... 16/256 colors of 8088 MPH has no snow anymore? Only sometimes on the first row?
Edit: Fixed the T1 stall to be 3 cycles instead of 4./

Edit: 8088tst3 latest results:

real:	disp1:	comp1:		disp3:	comp3:
FF43 FF38 <(-11) FF35 -14
FE59 FE48 <(-17) FE3D -28
FDC5 FDB3 <(-18) FDA4 -33
FD58 FD44 <(-20) FD33 -37
FD2A FD1A <(-16) FD07 -35
FC6B FC63 <(-8) FC49 -34
FBB7 FBAE <(-9) FB92 -37
F9A9 F990 <(-25) F96C -61
CPU test complete. Elapsed timer ticks:
07CA 0786 <(-68) 07AF -27

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 79 of 122, by GloriousCow

User metadata
Rank Member
Rank
Member
superfury wrote on 2023-08-06, 11:52:
Just implemented an extra timer on the BIU emulation of the 808x: - When T3 is ticked (proceeding onto T4), it will set a flag i […]
Show full quote

Just implemented an extra timer on the BIU emulation of the 808x:
- When T3 is ticked (proceeding onto T4), it will set a flag if the prefetch isn't empty.
- When T1 arrives to tick and either said flag is set, or no request is made, an additional check is made before checking the requests from the EU:
-- (in both cases below the above flag is cleared, preventing it from retriggering until after the EU request finishes after this)
-- PIQ not full? Perform a prefetch instead.
-- PIQ full? Perform 4 idle clock cycles instead (thus taking T1-T4 cycles with idle bus).

Bear in mind I model this as a state we enter; not a one time event.

It's perhaps most noticeable in string instructions with REP prefixes, for example a REP MOVSB will incur the 3 cycle delay after the first R/W iteration as the queue fills up. Once full, we now delay 3 extra cycles per iteration (effectively 18% of iteration time) because we do not 'resume' the BIU until a byte is actually read out of the queue again, which it won't until the operation is complete. There are other possible ways to model this logic - perhaps you could assume that the prefetcher still schedules on a full queue, and so the delay is explained by a fetch attempt each time instead of specifically delaying EU operations - then you wouldn't need to track state. I don't yet know how to test which underlying theory is correct.

I don't know if you can just add these delays into your code easily, since you are likely accounting for them in some sort of static cycle count already. To properly emulate all the 8088 bus delays, i think one pretty much has to model the microcode execution time exactly and let the BIU delay logic fill in the rest.

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc