VOGONS


First post, by Battler

User metadata
Rank Member
Rank
Member

I'm currently working on getting 86Box's 808x emulation to at least close to the real thing. Fixed the prefetch queue already (at least enough that the 8088mph credits and Snatch-It now work), and brought the cycles detected by 8088mph to 1637 (not much off from the expected 1678 +/- 10), after implementing the MUL/IMUL bit set etc. cycle penalties per reenigne's blog, EA calculation cycles, etc., but here's where I hit the roadblock - there's no information anywhere about DIV/IDIV or the REP stuff other than the official cycle counts which I assume are not 100% exact. Does any one have any measurements done on the real thing, and for the REP's specifically, a some sort of flow chart to see how exactly they behave wrt. the prefetch queue and interrupts (and do NMI's and the TRAP interrupt get serviced during REP on an 808x)?

Edit: And this is not yet committed (will not be, until I'm done), the last committed code still has 8088mph return 1485-1492 cycles and broken prefetch queue.

Reply 1 of 26, by peterferrie

User metadata
Rank Oldbie
Rank
Oldbie

Yes, interrupts get serviced during REP, and that introduces the issue that if it has more than one prefix (CS: REP: xxx), one of the prefixes will be dropped. That causes the obvious effect of the wrong data being copied.
I have no idea about the cycle counts, though.

Reply 2 of 26, by Battler

User metadata
Rank Member
Rank
Member

So the CPU loops like:
- Start timer period;
- Execute instruction (this includes fetching the opcode);
- Fill the prefetch queue based on instruction cycles + memory R/W cycles + any accumulated prefetch-induced cycles*;
- Take care of memory R/W cycles if any;
- End timer period;
- Service trap, NMI, and IRQ interrupts;
- Go back to the top of the loop.

* This should be done at the beginning, but the PCem code that I forked from was doing it at the end, in order to fetch the same number of bytes the real CPU would fetch during the instruction's execution, so I solved it by buffering 24 bytes from CS around the current prefetch IP and making the prefetch queue read from the buffer if the buffer is filled. That makes prefetch-modify-run sequences work (used by 8088mph Credits and Snatch-It, for example).

So I would imagine the REP loop would look like this:
- Start timer period;
- Execute string instruction;
- If the conditions for breaking out from the loop are present, break out from the loop and let the main loop take over;
- Otherwise, set IP back to the beginning of the REP insturction;
- Fill the prefetch queue based on instruction cycles + memory R/W cycles + any accumulated prefetch-induced cycles*;
- Take care of memory R/W cycles if any;
- End timer period;
- Service trap, NMI, and IRQ interrupts;
- Go back to the top of the loop.

So the question is - is this right? Is IP reset back after every REP iteration (except of course when the condition was met and REP is over) and the prefetch queue flushed due to that? I know that from the later steppings of the 386 onwards, this should be the case as the TRAP flag (which is used for example by DOS DEBUG.EXE) caused execution to be trapped after every iteration of the REP, but I know that earlier steppings of the 386 had a bug where this wasn't the case and it only stopped at the end of REP, and I have no idea what happens on 808x and 286.

Reply 3 of 26, by superfury

User metadata
Rank l33t++
Rank
l33t++

As far as I know, REP instructions don't affect IP/EIP. It should just continue to run without any prefetch reads for as long as there's data left to process(depending on (E)CX and/or zero flags).

So only the initial execution and executions after faults/interrupts are fetched from the PIQ. The instruction keeps it's running state (not terminating the instruction) until either a interrupt/fault occurs or the rep finishes.

So even if you'd disable the prefetch, the REP would still process, as no prefetch from PIQ->EU are done during it's execution. Imagine the required overhead if it did(up to two bytes overhead for every REP transfer done). Then performance would go down the drain.

Essentially, only interrupts/faults reset (E)IP, while the PIQ is flushed by said interrupt only. Thus the same case as with many other exceptions returning to the opcode.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 26, by Battler

User metadata
Rank Member
Rank
Member

Yes, but what about prefetch queue writes? Those happen at around 4 cycles per write to prefetch queue, for a maximum of 4 writes on 8088, and 6 on 8086. Though I guess I could refactor the code to do that at REP stop due to either condition met or interrupt.
Also, does the 8088 stop REP on TRAP? And I imagine it would on a NMI.

Reply 5 of 26, by superfury

User metadata
Rank l33t++
Rank
l33t++

Well, since REP prefixed instructions execute bus cycles(when available) I can imagine during those, prefetch isn't prefetching. But during the remainder of the cycles it might, depending on the instruction that's repeating's timings. So it's probably switching between prefetches to P(Q(while not full), EU accesses(e.g. MOVS) and DMA occasionally(when active). And those are dependant on the instruction's timings(EU) and hardware timings accordingly(just no PIQ->EU transfers due to repeated instructions being buffered inside the EU somehow).

From the BIU perspective and DMA perspective, nothing changes. They just keep running normally. So it's mainly the EU timings and transfers to/from BIU that keep running on it's own timing(EU cycles) that's different in said cases.

NMI would trap, just like normal interrupts. It's just that while running, the EU would assume the same instruction is to be executed(reusing the previously loaded/decoded state again) instead of fetching/decoding a next instruction(already done after all). So simply no PIQ fetch and decode. Everything else would work like normal sequential instructions(essentially a new opcode(interrupt check & state reset for a new instruction), skipping fetch/decode(already done) phases and continuing to execution state, maybe a check afterwards to check for completion). So essentially a hardware-type loop(in microcode-like instructions?) being applied?

Traps propably will handle the same, except newer traps(e.g. interrupt 6+), which didn't exist yet back then.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 26, by Battler

User metadata
Rank Member
Rank
Member

By trap, I mean INT 1, the single-step trap enabled by the T flag. Do those stop the REP on 808x? I know they do from the later steppings of the 386 onwards, and I know they don't on earlier 386 steppings but it's marked as a defect, and I have no idea what happens on 808x and 286.

Reply 7 of 26, by Battler

User metadata
Rank Member
Rank
Member

After fixing some REP stuff, making DMA 0 always call refreshread() even when the channel is otherwise inactive, and fixing the cycles of LODSW and REP LODSW, 8088mph reports 1642 cycles and this seems to work:
20181122_044201.png .
However I now have two more questions:
1. On real hardware, does a read/write on DMA 0 always trigger a RAM refresh, even if the channel is turned off / no data is being transferred?
2. What happens when the refresh is triggered? Is anything done to the prfetch queue? I notice that this code calls FETCHCOMPLETE() which adds a byte into the queue if there's space for it and inserts a few more cycles, but I'm not sure whether or not that's what happens on real hardware.

Reply 8 of 26, by superfury

User metadata
Rank l33t++
Rank
l33t++

@Battler: congratulations on the working part of the 8088MPH demo! I'm still a few cycles off in UniPCemu, though(implementing almost all timings from reenigne's original release in a block-processing way(using cycle delays). Can you tell us if it deviates from reenigne's findings and where?

Those cycles are a few more than my emulation(163X afaik).

Also, buffering in non-byte(8088)/word(8086) PIQ prefetches can't be right, especially 20+ bytes at once?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 26, by Battler

User metadata
Rank Member
Rank
Member
superfury wrote:

Also, buffering in non-byte(8088)/word(8086) PIQ prefetches can't be right, especially 20+ bytes at once?

The prefetch queue operates correctly, 4 or 6 bytes. It's just that the PCem code (which I started from), calls FETCHADD() (which adds the bytes to the prefetch queue) after the instruction has executed, which it does so that the number of bytes to be added to the queue is determined based on the instruction's cycles. So rather than trying to move the FETCHADD() calls to the beginning of the instruction execution and attempting to predict what the cycles would be in advance, I kept the calls where they are, and worked around the issue by saving a snapshot (which I call buffer, maybe I should rename it) of the 24 bytes around CS:IP as they were before the instruction executed, and adding bytes to the prefetch queue from that snapshot rather than from the current state of that part of memory. That way I make sure that when the 1 to 4 (or 6 on the 8086) bytes get added to the prefetch queue, they will be the bytes that were there before the instruction executed. This makes anything that relies on the fetch-modify-executed prefetched cycle to work.

Also, my code deviates from reenigne's findings only in the CGA wait states, I currently have them fixed to 11, but that's not correct. Because of the way the emulation is implemented, whenever a read/write to video memory happens, the CGA timer is paused because the timers are subordinate to the CPU instruction loop which starts the period at the beginning of the instruction execution, and ends them when the execution is over. I guess I could find a way to determine the current hdot whenever a read/write to video memory occurs, but how to then apply reenigne's findings, I have no idea.
The other thing I might not be getting right is the DMA channel 0 RAM refresh reads, which is why I ask when they happen and what they do on real hardware. On the original PCem code, RAM refresh is triggered on every DMA channel read/write (so also on channels 1, 2, and 3), but only whrn the DMA channel has been properly initialized and there is data to transfer. On my current code, RAM refresh is triggered always (so also when the channel is not initialized and the channel has no data), but only on channel 0. And I have no idea which of the two implementation is right, if either of them is right at all. And on both, the RAM refresh read does a prefetch queue fill (FETCHCOMPLETE()) and adds 4 bytes to memcycs (which are CPU cycles used for memory R/W that are to be subtracted from the amount of cycles to give to FETCHADD()). And I have no idea how correct that is, either.
The rest seems to be OK - I implemented his findings about the MUL and IMUL timings, and I have kind of attempted to add them to DIV and IDIV too, based on an observation by scali made here on Vogons some time ago. And I implemented his findings about opcode aliases and that one opcode / MOD / R/M combination that returns the last effective address. I also have the undocumented SETALC instruction in, but I'm not sure if the timings are right (I can't find the timings for it anywhere). I have no idea if he discovered anything else, maybe I should go read his blog more. :p

Reply 10 of 26, by Battler

User metadata
Rank Member
Rank
Member

OK, fixed the DMA channel 0 so that it only triggeres refresh read when the DMA is programmed to read or write, reenigne's blog says the BIOS takes care of that, and my logging has verified that that indeed happens, so that's one thing settled. However, my other question remains - what's with that FETCHCOMPLETE()? I suppose that's to complete the prefetch queue filling before the CPU is interrupted by the refresh? But I'm not sure.

Reply 11 of 26, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

Wow, very cool! Does 8088 MPH run correctly now? I might need to get a move on with the next (even more sensitive) testcase!

As for the original question about multiply/divide timings - there was a thread here a while ago (8086 multiplication algorithm?) where I pointed superfury to my cycle-exact multiply and divide code. The REP code there should be cycle-exact as well, though is less thoroughly tested. SALC should be there too, and I don't think there's anything complicated about the timings of that.

I haven't done any testing with NMI or trap yet, but I'd be surprised if they were different from other hardware interrupts in terms of the timing.

A DMA read on channel 0 is required to refresh all system RAM. Reading from RAM will also refresh, but will only refresh the bank than you read from (other banks won't see the refresh). A DMA write from channel 0 I'm not sure about - I have a nasty feeling that it may write the same value to all banks (but would also refresh those rows at the same time).

The CGA wait states can be done by finding how many CPU cycles between some arbitrary point (power up, say) and the CPU cycle where the CGA access happens, modulo 16. Then look up the result in an array which is some permutation of {3, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8}. Some playing around may be needed to determine the permutation corresponding to the "wait 8 hdots, wait for the next 16 hdot boundary, wait for the next CPU cycle to start" algorithm that I've previously described.

"RAM refresh read does a prefetch queue fill" does not sound right to me at all. The RAM refreshing is all done by the motherboard and the CPU doesn't know anything about it. The prefetch queue is entirely inside the CPU and the motherboard doesn't know anything about it. A CPU bus access (including memory, including prefetch) can delay or be delayed by a DMA bus access (including refresh).

Let me know if there's any questions in this thread that I missed.

There's more that I have discovered that I haven't written up on my blog yet. Please feel free to ask me questions here to save yourself from having to trawl through my blog.

Reply 12 of 26, by Battler

User metadata
Rank Member
Rank
Member

So far it appears to run fine, except that the credits have no sound but that's because the emulation of the PIT and the PC speaker leaves a lot to be desired, so it's on my list to be eventually either rewritten or ported from DOSBox where I heard PIT mode 1 (which I presume the credits use) works.
And thanks for linking to that thred, found the link to your code there, and I'm certainly going to look at that.

The CGA wait states can be done by finding how many CPU cycles between some arbitrary point (power up, say) and the CPU cycle where the CGA access happens,

I have the cycles variable, which is the total cycles of one block of execx86 (so CPU clock speed in Hz divided by 100) minus the cycles used by the instructions executed up to that point, it's globally accessible, so I guess I could do cycles & 0xf and then use that as the array index.

"RAM refresh read does a prefetch queue fill" does not sound right to me at all. The RAM refreshing is all done by the motherboard and the CPU doesn't know anything about it. The prefetch queue is entirely inside the CPU and the motherboard doesn't know anything about it. A CPU bus access (including memory, including prefetch) can delay or be delayed by a DMA bus access (including refresh).

Yeah, this simulates that, it basically adds 4 cycles to the execution time, and then runs FETCHCOMPLETE() which seems to be roughly equivalent to a FETCHADD(4), so one fetch, as fetches are, even according to the official 808x documentation from Intel, one per 4 cycles. And I think this is done because clockhardware() which ends the timer period (and causes all device timers, including the PIT which then writes to DMA refresh) to be executed, is called after the prefetch queue byte adding is done. Maybe I should move it to before and get rid of the FETCHCOMPLETE()? I'll try.

Reply 13 of 26, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
Battler wrote:

So far it appears to run fine, except that the credits have no sound but that's because the emulation of the PIT and the PC speaker leaves a lot to be desired, so it's on my list to be eventually either rewritten or ported from DOSBox where I heard PIT mode 1 (which I presume the credits use) works.

The credits use PIT mode 0 - that way we don't need to touch the gate to start the pulse, we just load the PIT value and it immediately starts counting down, so it's just a single byte write to port 0x42 per sample.

And thanks for linking to that thred, found the link to your code there, and I'm certainly going to look at that.

Battler wrote:

I have the cycles variable, which is the total cycles of one block of execx86 (so CPU clock speed in Hz divided by 100) minus the cycles used by the instructions executed up to that point, it's globally accessible, so I guess I could do cycles & 0xf and then use that as the array index.

Sounds like it should work!

Battler wrote:

Yeah, this simulates that, it basically adds 4 cycles to the execution time, and then runs FETCHCOMPLETE() which seems to be roughly equivalent to a FETCHADD(4), so one fetch, as fetches are, even according to the official 808x documentation from Intel, one per 4 cycles. And I think this is done because clockhardware() which ends the timer period (and causes all device timers, including the PIT which then writes to DMA refresh) to be executed, is called after the prefetch queue byte adding is done. Maybe I should move it to before and get rid of the FETCHCOMPLETE()? I'll try.

The number of wait states introduced by a DRAM refresh can be anything from 0 to 6 cycles (assuming no wait state for the refresh memory access itself). The zero case happens when the bus would otherwise be idle. Other cases depending on when the refresh happens with respect to the adjacent CPU bus accesses.

Reply 14 of 26, by Battler

User metadata
Rank Member
Rank
Member

The number of wait states introduced by a DRAM refresh can be anything from 0 to 6 cycles (assuming no wait state for the refresh memory access itself). The zero case happens when the bus would otherwise be idle. Other cases depending on when the refresh happens with respect to the adjacent CPU bus accesses.

How exactly do I determine when to return how many clces for the DRAM refresh wait state?

Reply 15 of 26, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
Battler wrote:

How exactly do I determine when to return how many clces for the DRAM refresh wait state?

It's complicated! Here are some sniffer logs I made that show the possible timings:

DMA states:
sDREQ
sHRQ, sHoldWait
sAEN
s0
s1
s2
s3
sWait
s4
sDelayedT1
sDelayedT2
sDelayedT3



20FFF .p... 00F16 FF 00 FC .......
20FFF Ip... 00F16 FF 00 FC ....... I
20FF1 SC... 00F16 FF 00 FC ....... S F6E1 MUL CL
00F17 .C... 00F17 FF 00 FC ....... T1
20F17 .C... 00F17 FF 00 FC ..r.... T2
20FF6 .p... 00F17 F6 00 FC ..r.... T3 F6 <-f [ 00F17]
20FF6 .C... 00F17 F6 00 FC ....... T4
00F18 .C... 00F18 F6 00 FC ....... T1
20F18 .C... 00F18 FF 01 FC ..r.... T2 S0 DREQ
20FE1 .p... 00F18 E1 01 FC ..r.... T3 S0 E1 <-f [ 00F18] passive
20FE1 .p... 00F18 E1 01 FC .....D. T4 S0 AEN
20FE1 .p... 00F18 E1 01 FC .....D. S0
20FE1 .p... 00204 E1 10 FC .....D. S1 DACK
20FE1 .p... 00204 E1 10 FC ..r..D. S2
20FE1 .p... 00204 E1 10 FC .Wr..D. S3 E1 <-d [ 00204]
20FE1 .p... 00204 E1 00 FC .....D. S4 -DACK
20FE1 .p... 00F18 E1 00 FC ....... -AEN

20FFF .p... 00F18 FF 00 FC .......
20FFF Ip... 00F18 FF 00 FC ....... I
20FF1 SC... 00F18 FF 00 FC ....... S F6E1 MUL CL
00F19 .C... 00F19 FF 00 FC ....... T1
20F19 .C... 00F19 FF 00 FC ..r.... T2
20FF6 .p... 00F19 F6 00 FC ..r.... T3 F6 <-f [ 00F19]
20FF6 .C... 00F19 F6 00 FC ....... T4
00F1A .C... 00F1A F6 01 FC ....... T1 S0 DREQ
20F1A .C... 00F1A FF 01 FC ..r.... T2 S0
20FE1 .p... 00F1A E1 01 FC ..r.... T3 S0 E1 <-f [ 00F1A] passive
20FE1 .p... 00F1A E1 01 FC .....D. T4 S0 AEN
20FE1 .p... 00F1A E1 01 FC .....D. S0
20FE1 .p... 00205 E1 10 FC .....D. S1 DACK
20FE1 .p... 00205 E1 10 FC ..r..D. S2
20FE1 .p... 00205 E1 10 FC .Wr..D. S3 E1 <-d [ 00205]
20FE1 .p... 00205 E1 00 FC .....D. S4 -DACK
20FE1 .p... 00F1A E1 00 FC ....... -AEN

20FFF .p... 00F1A FF 00 FC .......
20FFF Ip... 00F1A FF 00 FC ....... I
20FF1 SC... 00F1A FF 00 FC ....... S F6E1 MUL CL
00F1B .C... 00F1B FF 00 FC ....... T1
20F1B .C... 00F1B FF 00 FC ..r.... T2
20FF6 .p... 00F1B F6 00 FC ..r.... T3 F6 <-f [ 00F1B]
20FF6 .C... 00F1B F6 01 FC ....... T4 S0 DREQ
00F1C .C... 00F1C F6 01 FC ....... T1 S0
Show last 249 lines
20F1C .C...  00F1C FF 01 FC ..r....  T2 S0
20FE1 .p... 00F1C E1 01 FC ..r.... T3 S0 E1 <-f [ 00F1C] passive
20FE1 .p... 00F1C E1 01 FC .....D. T4 S0 AEN
20FE1 .p... 00F1C E1 01 FC .....D. S0
20FE1 .p... 00206 E1 10 FC .....D. S1 DACK
20FE1 .p... 00206 E1 10 FC ..r..D. S2
20FE1 .p... 00206 E1 10 FC .Wr..D. S3 E1 <-d [ 00206]
20FE1 .p... 00206 E1 00 FC .....D. S4 -DACK
20FE1 .p... 00F1C E1 00 FC ....... -AEN

20FFF .p... 00F1C FF 00 FC .......
20FFF Ip... 00F1C FF 00 FC ....... I
20FF1 SC... 00F1C FF 00 FC ....... S F6E1 MUL CL
00F1D .C... 00F1D FF 00 FC ....... T1
20F1D .C... 00F1D FF 00 FC ..r.... T2
20FF6 .p... 00F1D F6 01 FC ..r.... T3 S0 F6 <-f [ 00F1D] DREQ, passive
20FF6 .C... 00F1D F6 01 FC ....... T4 S0
00F1E .C... 00F1E F6 01 FC .....D. T1 S0 AEN
20F1E .C... 00F1E F6 01 FC .....D. T2 S0
20F1E .C... 00207 F6 10 FC .....D. T3 S1 DACK
20F1E .C... 00207 F6 10 FC ..r..D. Tw S2
20F1E .C... 00207 F6 10 FC .Wr..D. Tw S3 F6 <-d [ 00207]
20F1E .C... 00207 F6 00 FC .....D. Tw S4 -DACK
20FF6 .C... 00F1E F6 00 FC ..r.... Tw -AEN T1'
20FE1 .C... 00F1E E1 00 FC ..r.... Tw T2'
20FE1 .p... 00F1E E1 00 FC ..r.... Tw E1 <-f [ 00F1E] T3'
20FE1 .p... 00F1E E1 00 FC ....... T4

20FFF .p... 00F1E FF 00 FC .......
20FFF Ip... 00F1E FF 00 FC ....... I
20FF1 SC... 00F1E FF 00 FC ....... S F6E1 MUL CL
00F1F .C... 00F1F FF 00 FC ....... T1
20F1F .C... 00F1F FF 01 FC ..r.... T2 S0 DREQ
20FF6 .p... 00F1F F6 01 FC ..r.... T3 S0 F6 <-f [ 00F1F] passive
20FF6 .C... 00F1F F6 01 FC .....D. T4 S0 AEN
00F20 .C... 00F1F F6 01 FC .....D. T1 S0
20F20 .C... 00208 F6 10 FC .....D. T2 S1 DACK
20F20 .C... 00208 F6 10 FC ..r..D. T3 S2
20F20 .C... 00208 F6 10 FC .Wr..D. Tw S3 F6 <-d [ 00208]
20F20 .C... 00208 F6 00 FC .....D. Tw S4 -DACK
20FF6 .C... 00F20 F6 00 FC ..r.... Tw -AEN T1'
20FE1 .C... 00F20 E1 00 FC ..r.... Tw T2'
20FE1 .p... 00F20 E1 00 FC ..r.... Tw E1 <-f [ 00F20] T3'
20FE1 .p... 00F20 E1 00 FC ....... T4

20FFF .p... 00F20 FF 00 FC .......
20FFF Ip... 00F20 FF 00 FC ....... I
20FF1 SC... 00F20 FF 00 FC ....... S F6E1 MUL CL
00F21 .C... 00F21 FF 01 FC ....... T1 S0 DREQ
20F21 .C... 00F21 FF 01 FC ..r.... T2 S0
20FF6 .p... 00F21 F6 01 FC ..r.... T3 S0 F6 <-f [ 00F21] passive
20FF6 .C... 00F21 F6 01 FC .....D. T4 S0 AEN
00F22 .C... 00F21 F6 01 FC .....D. T1 S0
20F22 .C... 00209 F6 10 FC .....D. T2 S1 DACK
20F22 .C... 00209 F6 10 FC ..r..D. T3 S2
20F22 .C... 00209 F6 10 FC .Wr..D. Tw S3 F6 <-d [ 00209]
20F22 .C... 00209 F6 00 FC .....D. Tw S4 -DACK
20FF6 .C... 00F22 F6 00 FC ..r.... Tw -AEN T1'
20FE1 .C... 00F22 E1 00 FC ..r.... Tw T2'
20FE1 .p... 00F22 E1 00 FC ..r.... Tw E1 <-f [ 00F22] T3'
20FE1 .p... 00F22 E1 00 FC ....... T4

20FFF .p... 00F22 FF 00 FC .......
20FFF Ip... 00F22 FF 00 FC ....... I
20FF1 SC... 00F22 FF 01 FC ....... S0 S F6E1 MUL CL DREQ
00F23 .C... 00F23 FF 01 FC ....... T1 S0
20F23 .C... 00F23 FF 01 FC ..r.... T2 S0
20FF6 .p... 00F23 F6 01 FC ..r.... T3 S0 F6 <-f [ 00F23] passive
20FF6 .C... 00F23 F6 01 FC .....D. T4 S0 AEN
00F24 .C... 00F23 F6 01 FC .....D. T1 S0
20F24 .C... 0020A F6 10 FC .....D. T2 S1 DACK
20F24 .C... 0020A F6 10 FC ..r..D. T3 S2
20F24 .C... 0020A F6 10 FC .Wr..D. Tw S3 F6 <-d [ 0020A]
20F24 .C... 0020A F6 00 FC .....D. Tw S4 -DACK
20FF6 .C... 00F24 F6 00 FC ..r.... Tw -AEN T1'
20FE1 .C... 00F24 E1 00 FC ..r.... Tw T2'
20FE1 .p... 00F24 E1 00 FC ..r.... Tw E1 <-f [ 00F24] T3'
20FE1 .p... 00F24 E1 00 FC ....... T4

20FFF .p... 00F24 FF 00 FC .......
20FFF Ip... 00F24 FF 01 FC ....... S0 I DREQ, passive
20FF1 SC... 00F24 FF 01 FC ....... S0 S F6E1 MUL CL
00F25 .C... 00F25 FF 01 FC .....D. T1 S0 AEN
20F25 .C... 00F25 FF 01 FC .....D. T2 S0
20F25 .C... 0020B FF 10 FC .....D. T3 S1 DACK
20F25 .C... 0020B FF 10 FC ..r..D. Tw S2
20F25 .C... 0020B FF 10 FC .Wr..D. Tw S3 FF <-d [ 0020B]
20F25 .C... 0020B FF 00 FC .....D. Tw S4 -DACK
20FFF .C... 00F25 FF 00 FC ..r.... Tw -AEN T1'
20FF6 .C... 00F25 F6 00 FC ..r.... Tw T2'
20FF6 .p... 00F25 F6 00 FC ..r.... Tw F6 <-f [ 00F25] T3'
20FF6 .C... 00F25 F6 00 FC ....... T4
00F26 .C... 00F26 F6 00 FC ....... T1
20F26 .C... 00F26 FF 00 FC ..r.... T2
20FE1 .p... 00F26 E1 00 FC ..r.... T3 E1 <-f [ 00F26]
20FE1 .p... 00F26 E1 00 FC ....... T4

20FFF .p... 00F26 FF 00 FC .......
20FFF .p... 00F26 FF 01 FC ....... S0 DREQ, passive
20FFF Ip... 00F26 FF 01 FC ....... S0 I
20FF1 SC... 00F26 FF 01 FC .....D. S0 S F6E1 MUL CL AEN
00F27 .C... 00F26 FF 01 FC .....D. T1 S0
20F27 .C... 0020C FF 10 FC .....D. T2 S1 DACK
20F27 .C... 0020C FF 10 FC ..r..D. T3 S2
20F27 .C... 0020C FF 10 FC .Wr..D. Tw S3 FF <-d [ 0020C]
20F27 .C... 0020C FF 00 FC .....D. Tw S4 -DACK
20FFF .C... 00F27 FF 00 FC ..r.... Tw -AEN T1'
20FF6 .C... 00F27 F6 00 FC ..r.... Tw T2'
20FF6 .p... 00F27 F6 00 FC ..r.... Tw F6 <-f [ 00F27] T3'
20FF6 .C... 00F27 F6 00 FC ....... T4
00F28 .C... 00F28 F6 00 FC ....... T1
20F28 .C... 00F28 FF 00 FC ..r.... T2
20FE1 .p... 00F28 E1 00 FC ..r.... T3 E1 <-f [ 00F28]
20FE1 .p... 00F28 E1 00 FC ....... T4

20FFF .p... 00F28 FF 00 FC .......
20FFF .p... 00F28 FF 01 FC ....... S0 DREQ, passive
20FFF .p... 00F28 FF 01 FC ....... S0
20FFF Ip... 00F28 FF 01 FC .....D. S0 I AEN
20FF1 SC... 00F28 FF 01 FC .....D. S0 S F6E1 MUL CL
00F29 .C... 0020D FF 10 FC .....D. T1 S1 DACK
20F29 .C... 0020D FF 10 FC ..r..D. T2 S2
20F29 .C... 0020D FF 10 FC .Wr..D. T3 S3 FF <-d [ 0020D]
20F29 .C... 0020D FF 00 FC .....D. Tw S4 -DACK
20FFF .C... 00F29 FF 00 FC ..r.... Tw -AEN T1'
20FF6 .C... 00F29 F6 00 FC ..r.... Tw T2'
20FF6 .p... 00F29 F6 00 FC ..r.... Tw F6 <-f [ 00F29] T3'
20FF6 .C... 00F29 F6 00 FC ....... T4
00F2A .C... 00F2A F6 00 FC ....... T1
20F2A .C... 00F2A FF 00 FC ..r.... T2
20FE1 .p... 00F2A E1 00 FC ..r.... T3 E1 <-f [ 00F2A]
20FE1 .p... 00F2A E1 00 FC ....... T4

20FFF .p... 00F2A FF 00 FC .......
20FFF .p... 00F2A FF 01 FC ....... S0 DREQ, passive
20FFF .p... 00F2A FF 01 FC ....... S0
20FFF .p... 00F2A FF 01 FC .....D. S0 AEN
20FFF Ip... 00F2A FF 01 FC .....D. S0 I
20FF1 SC... 0020E FF 10 FC .....D. S1 S F6E1 MUL CL DACK
00F2B .C... 0020E FF 10 FC ..r..D. T1 S2
20F2B .C... 0020E FF 10 FC .Wr..D. T2 S3 FF <-d [ 0020E]
20F2B .C... 0020E FF 00 FC .....D. T3 S4 -DACK
20FFF .C... 00F2B FF 00 FC ..r.... Tw -AEN T1'
20FF6 .C... 00F2B F6 00 FC ..r.... Tw T2'
20FF6 .p... 00F2B F6 00 FC ..r.... Tw F6 <-f [ 00F2B] T3'
20FF6 .C... 00F2B F6 00 FC ....... T4
00F2C .C... 00F2C F6 00 FC ....... T1
20F2C .C... 00F2C FF 00 FC ..r.... T2
20FE1 .p... 00F2C E1 00 FC ..r.... T3 E1 <-f [ 00F2C]
20FE1 .p... 00F2C E1 00 FC ....... T4

20FFF .p... 00F2C FF 00 FC .......
20FFF .p... 00F2C FF 01 FC ....... S0 DREQ, passive
20FFF .p... 00F2C FF 01 FC ....... S0
20FFF .p... 00F2C FF 01 FC .....D. S0 AEN
20FFF .p... 00F2C FF 01 FC .....D. S0
20FFF Ip... 0020F FF 10 FC .....D. S1 I DACK
20FF1 SC... 0020F FF 10 FC ..r..D. S2 S F6E1 MUL CL
00F2D .C... 0020F FF 10 FC .Wr..D. T1 S3 FF <-d [ 0020F]
20F2D .C... 0020F FF 00 FC .....D. T2 S4 -DACK
20FFF .C... 00F2D FF 00 FC ..r.... T3 -AEN T1'
20FF6 .C... 00F2D F6 00 FC ..r.... Tw T2'
20FF6 .p... 00F2D F6 00 FC ..r.... Tw F6 <-f [ 00F2D] T3'
20FF6 .C... 00F2D F6 00 FC ....... T4
00F2E .C... 00F2E F6 00 FC ....... T1
20F2E .C... 00F2E FF 00 FC ..r.... T2
20FE1 .p... 00F2E E1 00 FC ..r.... T3 E1 <-f [ 00F2E]
20FE1 .p... 00F2E E1 00 FC ....... T4

20FFF .p... 00F2E FF 00 FC .......
20FFF .p... 00F2E FF 01 FC ....... S0 DREQ, passive
20FFF .p... 00F2E FF 01 FC ....... S0
20FFF .p... 00F2E FF 01 FC .....D. S0 AEN
20FFF .p... 00F2E FF 01 FC .....D. S0
20FFF .p... 00210 FF 10 FC .....D. S1 DACK
20FFF Ip... 00210 FF 10 FC ..r..D. S2 I
20FF1 SC... 00210 FF 10 FC .Wr..D. S3 FF <-d [ 00210] S F6E1 MUL CL
00F2F .C... 00210 FF 00 FC .....D. T1 S4 -DACK
20F2F .C... 00F2F FF 00 FC ..r.... T2 -AEN T1'
20FF6 .C... 00F2F F6 00 FC ..r.... T3 T2'
20FF6 .p... 00F2F F6 00 FC ..r.... Tw F6 <-f [ 00F2F] T3'
20FF6 .C... 00F2F F6 00 FC ....... T4
00F30 .C... 00F30 F6 00 FC ....... T1
20F30 .C... 00F30 FF 00 FC ..r.... T2
20FE1 .p... 00F30 E1 00 FC ..r.... T3 E1 <-f [ 00F30]
20FE1 .p... 00F30 E1 00 FC ....... T4

20FFF .p... 00F30 FF 00 FC .......
20FFF .p... 00F30 FF 01 FC ....... S0 DREQ, passive
20FFF .p... 00F30 FF 01 FC ....... S0
20FFF .p... 00F30 FF 01 FC .....D. S0 AEN
20FFF .p... 00F30 FF 01 FC .....D. S0
20FFF .p... 00211 FF 10 FC .....D. S1 DACK
20FFF .p... 00211 FF 10 FC ..r..D. S2
20FFF Ip... 00211 FF 10 FC .Wr..D. S3 FF <-d [ 00211] I
20FF1 SC... 00211 FF 00 FC .....D. S4 S F6E1 MUL CL -DACK
00F31 .C... 00F31 FF 00 FC ....... T1 -AEN
20F31 .C... 00F31 FF 00 FC ..r.... T2
20FF6 .p... 00F31 F6 00 FC ..r.... T3 F6 <-f [ 00F31]
20FF6 .C... 00F31 F6 00 FC ....... T4
00F32 .C... 00F32 F6 00 FC ....... T1
20F32 .C... 00F32 FF 00 FC ..r.... T2
20FE1 .p... 00F32 E1 00 FC ..r.... T3 E1 <-f [ 00F32]
20FE1 .p... 00F32 E1 00 FC ....... T4

20FFF .p... 00F32 FF 00 FC .......
20FFF .p... 00F32 FF 01 FC ....... S0 DREQ, passive
20FFF .p... 00F32 FF 01 FC ....... S0
20FFF .p... 00F32 FF 01 FC .....D. S0 AEN
20FFF .p... 00F32 FF 01 FC .....D. S0
20FFF .p... 00212 FF 10 FC .....D. S1
20FFF .p... 00212 FF 10 FC ..r..D. S2
20FFF .p... 00212 FF 10 FC .Wr..D. S3 FF <-d [ 00212]
20FFF Ip... 00212 FF 00 FC .....D. S4 I
20FF1 SC... 00F32 FF 00 FC ....... S F6E1 MUL CL
00F33 .C... 00F33 FF 00 FC ....... T1
20F33 .C... 00F33 FF 00 FC ..r.... T2
20FF6 .p... 00F33 F6 00 FC ..r.... T3 F6 <-f [ 00F33]
20FF6 .C... 00F33 F6 00 FC ....... T4
00F34 .C... 00F34 F6 00 FC ....... T1
20F34 .C... 00F34 FF 00 FC ..r.... T2
20FE1 .p... 00F34 E1 00 FC ..r.... T3 E1 <-f [ 00F34]
20FE1 .p... 00F34 E1 00 FC ....... T4

There are always at least 4 (sometimes 5 or 6) cycles of DREQ before the DACK goes active


60000 .p... 00000 00 00 FC 6 .W..... Tw 00 --> port[0000]
60000 .W... 00000 00 01 FC 6 ....... T4 DREQ
00001 .W... 00001 00 01 FC 7 ....... T1
60000 .W... 00001 00 01 FC 7 .W..... T2
60000 .W... 00001 00 01 FC 6 .W..... T3
60000 .p... 00001 00 01 FC 6 .W..... Tw 00 --> port[0001] passive
60000 IC... 00001 00 01 FC 7 ....... T4 I preAEN
10A90 .C... 10A90 00 01 FC 7 .....D. T1 S0 AEN
60A90 .C... 10A90 00 01 FC 6 .....D. T2 S1
60A90 .C... 00000 00 10 FC 6 .....D. T3 S2
60A90 .C... 00000 00 10 FC 7 ..r..D. Tw S3
60A90 .C... 00000 00 10 FC 7 .Wr..D. Tw S4 00 <-d [ 00000]
60A90 .C... 00000 00 00 FC 6 .....D. Tw
60A00 .C... 10A90 00 00 FC 6 ..r.... Tw
60A00 .C... 10A90 00 00 FC 7 ..r.... Tw
60A00 .p... 10A90 00 00 FC 7 ..r.... Tw 00 <-f [ 10A90]
60A00 .C... 10A90 00 00 FC 6 ....... T4
10A91 .C... 10A91 00 00 FC 6 ....... T1
60A91 SC... 10A91 FD 00 FC 7 ..r.... T2 S 0400 ADD AL, 00
60AEB .p... 10A91 EB 00 FC 7 ..r.... T3 EB <-f [ 10A91]
60AEB .C... 10A91 EB 00 FC 6 ....... T4

There's also code in https://github.com/reenigne/reenigne/blob/mas … 088/xtce/xtce.h that emulates this with cycle accuracy but I'm afraid it works by emulating the bus, CPU and DMAC cycle-by-cycle, there's no function that returns the number of cycles (0 to 6) to wait. The _dmaState switch in BusEmulator::wait() (line 1814) is probably the interesting bit.

Reply 16 of 26, by superfury

User metadata
Rank l33t++
Rank
l33t++

@reenigne: such a cycle-by-cycle method is also used in UniPCemu(using the 14MHz clock ticks or hardware's own oscillator(e.g. sound blaster PCM). But since I don't have any prebuffer on the PIQ fetching data, combined with non-100% cycle accuracy on the EU, the SMC on the credits crashes executing partially modified/unmodified code.

The CGA runs like you said(all those delays), essentially adding waitstates until the horizontal conditions are met one by one accordingly(within the video-renderer) on T3/T4(don't remember which one atm, should be correct).

So the main issue in my emulator is the slight remaining EU timings being incorrect, causing all scanline-racing issues and credits crash.

Perhaps I could make an EU-added cycle log on 8088 MPH. That way we might find out what's timing incorrectly? So a cycle-log with EU timing on an extra tab(the EU delays being executed, so the counter of remaining EU clocks running instructions)?

Edit: Actually been thinking for a bit. Assuming timings are identical(e.g. both running at 3MIPS in IPS clocking mode), redirecting console output(or pipe) between emulators for validation of opcodes executed? So pipe disassembly, registers etc.(in common log format) from one to another emulator(e.g. cmd /C "UniPCemu_x64.exe pipeout | pcem.exe pipein") and letting the latter verify the instructions with it's own state? Could also be done on cycle-accurate mode?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 17 of 26, by superfury

User metadata
Rank l33t++
Rank
l33t++

Also, in UniPCemu's DMA emulation, only SI, S0, S1, S2, S3 and S4, where S4 can become SI(or even further into S0), depending on a running block transfer(depending on if the bus is released).

https://bitbucket.org/superfury/unipcemu/src/ … dma.c?at=master

The only thing not emulated in UniPCemu(compared to reenigne's code) is the odd things(BIU T-cycle-based emulation) in instructions like HLT etc.

8088MPH reports 1539 cycles in my emulator.

Since 8088 MPH requires 1678(+/- 10) cycles, it's about 139 cycles off. Now where are those? What instruction is generating not enough cycles(are those HLT instructions etc. even used?)?

Edit: Hmmmm....
https://github.com/reenigne/reenigne/blob/mas … efrens.asm#L569

Seems like there's a HLT in there! Could that be what's throwing off most of the loop in my case?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 18 of 26, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:
Also, in UniPCemu's DMA emulation, only SI, S0, S1, S2, S3 and S4, where S4 can become SI(or even further into S0), depending on […]
Show full quote

Also, in UniPCemu's DMA emulation, only SI, S0, S1, S2, S3 and S4, where S4 can become SI(or even further into S0), depending on a running block transfer(depending on if the bus is released).

https://bitbucket.org/superfury/unipcemu/src/ … dma.c?at=master

The only thing not emulated in UniPCemu(compared to reenigne's code) is the odd things(BIU T-cycle-based emulation) in instructions like HLT etc.

8088MPH reports 1539 cycles in my emulator.

Since 8088 MPH requires 1678(+/- 10) cycles, it's about 139 cycles off. Now where are those? What instruction is generating not enough cycles(are those HLT instructions etc. even used?)?

Edit: Hmmmm....
https://github.com/reenigne/reenigne/blob/mas … efrens.asm#L569

Seems like there's a HLT in there! Could that be what's throwing off most of the loop in my case?

Does the code between kefrensScanline and kefrensScanlineEnd take 304 cycles? If not, then the problem isn't (only) with HLT.

Chances are that most of the problems are in the interaction between memory/port bus accesses and prefetch bus accesses. How the CPU decides exactly when to start each seems to be extremely complicated. Way more so than it has any right to be - there's a 560-line function in XTCE (busInit()) which handles this, all determined by trial and error, and it seems like it should be just a handful of lines. I'm wondering if there are some vestiges of the 8086's 16-bit BIU and prefetch queue that are causing some of this complexity. I will have to have a play and see if I can come up with a simpler algorithm once I've reworked it so that I can test one version of XTCE against another rather than against the real hardware (slow) or against a massive file full of testcase measurement data (unwieldy - the file would be too big to fit into RAM with all the variations I really want to test).

Reply 19 of 26, by superfury

User metadata
Rank l33t++
Rank
l33t++

@reenigne: If your EU cycle counts for those instructions(the totals, excluding specific timings in between(e.g. part before and part after memory access only being applied after a memory transaction(e.g. mov al,[bx]), then those should be working(theoretically). The CRTC should be functioning correctly(and everything related, like CPU waitstates for horizontal timings etc.), seeing as the 4K colors part still works.

I'd need a start CS:IP address(e.g. segment:offset) to check said code, though, to start checking it.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io