808x MUL/IMUL/DIV/IDIV/REP cycles/operation \ VOGONS

Reply 1 of 26, by peterferrie

Posted on 2018-11-21, 20:47

peterferrie Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2008-05-08, 21:54

Yes, interrupts get serviced during REP, and that introduces the issue that if it has more than one prefix (CS: REP: xxx), one of the prefixes will be dropped. That causes the obvious effect of the wrong data being copied.
I have no idea about the cycle counts, though.

Reply 2 of 26, by Battler

Posted on 2018-11-21, 21:15

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

So the CPU loops like:
- Start timer period;
- Execute instruction (this includes fetching the opcode);
- Fill the prefetch queue based on instruction cycles + memory R/W cycles + any accumulated prefetch-induced cycles*;
- Take care of memory R/W cycles if any;
- End timer period;
- Service trap, NMI, and IRQ interrupts;
- Go back to the top of the loop.

* This should be done at the beginning, but the PCem code that I forked from was doing it at the end, in order to fetch the same number of bytes the real CPU would fetch during the instruction's execution, so I solved it by buffering 24 bytes from CS around the current prefetch IP and making the prefetch queue read from the buffer if the buffer is filled. That makes prefetch-modify-run sequences work (used by 8088mph Credits and Snatch-It, for example).

So I would imagine the REP loop would look like this:
- Start timer period;
- Execute string instruction;
- If the conditions for breaking out from the loop are present, break out from the loop and let the main loop take over;
- Otherwise, set IP back to the beginning of the REP insturction;
- Fill the prefetch queue based on instruction cycles + memory R/W cycles + any accumulated prefetch-induced cycles*;
- Take care of memory R/W cycles if any;
- End timer period;
- Service trap, NMI, and IRQ interrupts;
- Go back to the top of the loop.

So the question is - is this right? Is IP reset back after every REP iteration (except of course when the condition was met and REP is over) and the prefetch queue flushed due to that? I know that from the later steppings of the 386 onwards, this should be the case as the TRAP flag (which is used for example by DOS DEBUG.EXE) caused execution to be trapped after every iteration of the REP, but I know that earlier steppings of the 386 had a bug where this wasn't the case and it only stopped at the end of REP, and I have no idea what happens on 808x and 286.

Reply 3 of 26, by superfury

Posted on 2018-11-21, 22:42

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5489
Joined: 2014-03-08, 11:25
Location: Netherlands

As far as I know, REP instructions don't affect IP/EIP. It should just continue to run without any prefetch reads for as long as there's data left to process(depending on (E)CX and/or zero flags).

So only the initial execution and executions after faults/interrupts are fetched from the PIQ. The instruction keeps it's running state (not terminating the instruction) until either a interrupt/fault occurs or the rep finishes.

So even if you'd disable the prefetch, the REP would still process, as no prefetch from PIQ->EU are done during it's execution. Imagine the required overhead if it did(up to two bytes overhead for every REP transfer done). Then performance would go down the drain.

Essentially, only interrupts/faults reset (E)IP, while the PIQ is flushed by said interrupt only. Thus the same case as with many other exceptions returning to the opcode.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 26, by Battler

Posted on 2018-11-21, 23:07

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

Yes, but what about prefetch queue writes? Those happen at around 4 cycles per write to prefetch queue, for a maximum of 4 writes on 8088, and 6 on 8086. Though I guess I could refactor the code to do that at REP stop due to either condition met or interrupt.
Also, does the 8088 stop REP on TRAP? And I imagine it would on a NMI.

Reply 5 of 26, by superfury

Posted on 2018-11-22, 00:53

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5489
Joined: 2014-03-08, 11:25
Location: Netherlands

Well, since REP prefixed instructions execute bus cycles(when available) I can imagine during those, prefetch isn't prefetching. But during the remainder of the cycles it might, depending on the instruction that's repeating's timings. So it's probably switching between prefetches to P(Q(while not full), EU accesses(e.g. MOVS) and DMA occasionally(when active). And those are dependant on the instruction's timings(EU) and hardware timings accordingly(just no PIQ->EU transfers due to repeated instructions being buffered inside the EU somehow).

From the BIU perspective and DMA perspective, nothing changes. They just keep running normally. So it's mainly the EU timings and transfers to/from BIU that keep running on it's own timing(EU cycles) that's different in said cases.

NMI would trap, just like normal interrupts. It's just that while running, the EU would assume the same instruction is to be executed(reusing the previously loaded/decoded state again) instead of fetching/decoding a next instruction(already done after all). So simply no PIQ fetch and decode. Everything else would work like normal sequential instructions(essentially a new opcode(interrupt check & state reset for a new instruction), skipping fetch/decode(already done) phases and continuing to execution state, maybe a check afterwards to check for completion). So essentially a hardware-type loop(in microcode-like instructions?) being applied?

Traps propably will handle the same, except newer traps(e.g. interrupt 6+), which didn't exist yet back then.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 26, by Battler

Posted on 2018-11-22, 01:08

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

By trap, I mean INT 1, the single-step trap enabled by the T flag. Do those stop the REP on 808x? I know they do from the later steppings of the 386 onwards, and I know they don't on earlier 386 steppings but it's marked as a defect, and I have no idea what happens on 808x and 286.

Reply 7 of 26, by Battler

Posted on 2018-11-22, 03:47

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

After fixing some REP stuff, making DMA 0 always call refreshread() even when the channel is otherwise inactive, and fixing the cycles of LODSW and REP LODSW, 8088mph reports 1642 cycles and this seems to work:
.
However I now have two more questions:
1. On real hardware, does a read/write on DMA 0 always trigger a RAM refresh, even if the channel is turned off / no data is being transferred?
2. What happens when the refresh is triggered? Is anything done to the prfetch queue? I notice that this code calls FETCHCOMPLETE() which adds a byte into the queue if there's space for it and inserts a few more cycles, but I'm not sure whether or not that's what happens on real hardware.

Reply 8 of 26, by superfury

Posted on 2018-11-22, 08:00

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5489
Joined: 2014-03-08, 11:25
Location: Netherlands

@Battler: congratulations on the working part of the 8088MPH demo! I'm still a few cycles off in UniPCemu, though(implementing almost all timings from reenigne's original release in a block-processing way(using cycle delays). Can you tell us if it deviates from reenigne's findings and where?

Those cycles are a few more than my emulation(163X afaik).

Also, buffering in non-byte(8088)/word(8086) PIQ prefetches can't be right, especially 20+ bytes at once?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 26, by Battler

Posted on 2018-11-22, 13:17

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

superfury wrote:
Also, buffering in non-byte(8088)/word(8086) PIQ prefetches can't be right, especially 20+ bytes at once?

The prefetch queue operates correctly, 4 or 6 bytes. It's just that the PCem code (which I started from), calls FETCHADD() (which adds the bytes to the prefetch queue) after the instruction has executed, which it does so that the number of bytes to be added to the queue is determined based on the instruction's cycles. So rather than trying to move the FETCHADD() calls to the beginning of the instruction execution and attempting to predict what the cycles would be in advance, I kept the calls where they are, and worked around the issue by saving a snapshot (which I call buffer, maybe I should rename it) of the 24 bytes around CS:IP as they were before the instruction executed, and adding bytes to the prefetch queue from that snapshot rather than from the current state of that part of memory. That way I make sure that when the 1 to 4 (or 6 on the 8086) bytes get added to the prefetch queue, they will be the bytes that were there before the instruction executed. This makes anything that relies on the fetch-modify-executed prefetched cycle to work.

Also, my code deviates from reenigne's findings only in the CGA wait states, I currently have them fixed to 11, but that's not correct. Because of the way the emulation is implemented, whenever a read/write to video memory happens, the CGA timer is paused because the timers are subordinate to the CPU instruction loop which starts the period at the beginning of the instruction execution, and ends them when the execution is over. I guess I could find a way to determine the current hdot whenever a read/write to video memory occurs, but how to then apply reenigne's findings, I have no idea.
The other thing I might not be getting right is the DMA channel 0 RAM refresh reads, which is why I ask when they happen and what they do on real hardware. On the original PCem code, RAM refresh is triggered on every DMA channel read/write (so also on channels 1, 2, and 3), but only whrn the DMA channel has been properly initialized and there is data to transfer. On my current code, RAM refresh is triggered always (so also when the channel is not initialized and the channel has no data), but only on channel 0. And I have no idea which of the two implementation is right, if either of them is right at all. And on both, the RAM refresh read does a prefetch queue fill (FETCHCOMPLETE()) and adds 4 bytes to memcycs (which are CPU cycles used for memory R/W that are to be subtracted from the amount of cycles to give to FETCHADD()). And I have no idea how correct that is, either.
The rest seems to be OK - I implemented his findings about the MUL and IMUL timings, and I have kind of attempted to add them to DIV and IDIV too, based on an observation by scali made here on Vogons some time ago. And I implemented his findings about opcode aliases and that one opcode / MOD / R/M combination that returns the last effective address. I also have the undocumented SETALC instruction in, but I'm not sure if the timings are right (I can't find the timings for it anywhere). I have no idea if he discovered anything else, maybe I should go read his blog more. :p

Reply 10 of 26, by Battler

Posted on 2018-11-22, 13:33

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

OK, fixed the DMA channel 0 so that it only triggeres refresh read when the DMA is programmed to read or write, reenigne's blog says the BIOS takes care of that, and my logging has verified that that indeed happens, so that's one thing settled. However, my other question remains - what's with that FETCHCOMPLETE()? I suppose that's to complete the prefetch queue filling before the CPU is interrupted by the refresh? But I'm not sure.

Reply 11 of 26, by reenigne

Posted on 2018-11-22, 15:11

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

Wow, very cool! Does 8088 MPH run correctly now? I might need to get a move on with the next (even more sensitive) testcase!

As for the original question about multiply/divide timings - there was a thread here a while ago (8086 multiplication algorithm?) where I pointed superfury to my cycle-exact multiply and divide code. The REP code there should be cycle-exact as well, though is less thoroughly tested. SALC should be there too, and I don't think there's anything complicated about the timings of that.

I haven't done any testing with NMI or trap yet, but I'd be surprised if they were different from other hardware interrupts in terms of the timing.

A DMA read on channel 0 is required to refresh all system RAM. Reading from RAM will also refresh, but will only refresh the bank than you read from (other banks won't see the refresh). A DMA write from channel 0 I'm not sure about - I have a nasty feeling that it may write the same value to all banks (but would also refresh those rows at the same time).

The CGA wait states can be done by finding how many CPU cycles between some arbitrary point (power up, say) and the CPU cycle where the CGA access happens, modulo 16. Then look up the result in an array which is some permutation of {3, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8}. Some playing around may be needed to determine the permutation corresponding to the "wait 8 hdots, wait for the next 16 hdot boundary, wait for the next CPU cycle to start" algorithm that I've previously described.

"RAM refresh read does a prefetch queue fill" does not sound right to me at all. The RAM refreshing is all done by the motherboard and the CPU doesn't know anything about it. The prefetch queue is entirely inside the CPU and the motherboard doesn't know anything about it. A CPU bus access (including memory, including prefetch) can delay or be delayed by a DMA bus access (including refresh).

Let me know if there's any questions in this thread that I missed.

There's more that I have discovered that I haven't written up on my blog yet. Please feel free to ask me questions here to save yourself from having to trawl through my blog.

Reply 12 of 26, by Battler

Posted on 2018-11-22, 15:25

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

So far it appears to run fine, except that the credits have no sound but that's because the emulation of the PIT and the PC speaker leaves a lot to be desired, so it's on my list to be eventually either rewritten or ported from DOSBox where I heard PIT mode 1 (which I presume the credits use) works.
And thanks for linking to that thred, found the link to your code there, and I'm certainly going to look at that.

The CGA wait states can be done by finding how many CPU cycles between some arbitrary point (power up, say) and the CPU cycle where the CGA access happens,

I have the cycles variable, which is the total cycles of one block of execx86 (so CPU clock speed in Hz divided by 100) minus the cycles used by the instructions executed up to that point, it's globally accessible, so I guess I could do cycles & 0xf and then use that as the array index.

"RAM refresh read does a prefetch queue fill" does not sound right to me at all. The RAM refreshing is all done by the motherboard and the CPU doesn't know anything about it. The prefetch queue is entirely inside the CPU and the motherboard doesn't know anything about it. A CPU bus access (including memory, including prefetch) can delay or be delayed by a DMA bus access (including refresh).

Yeah, this simulates that, it basically adds 4 cycles to the execution time, and then runs FETCHCOMPLETE() which seems to be roughly equivalent to a FETCHADD(4), so one fetch, as fetches are, even according to the official 808x documentation from Intel, one per 4 cycles. And I think this is done because clockhardware() which ends the timer period (and causes all device timers, including the PIT which then writes to DMA refresh) to be executed, is called after the prefetch queue byte adding is done. Maybe I should move it to before and get rid of the FETCHCOMPLETE()? I'll try.

Reply 13 of 26, by reenigne

Posted on 2018-11-22, 16:02

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

Battler wrote:
So far it appears to run fine, except that the credits have no sound but that's because the emulation of the PIT and the PC speaker leaves a lot to be desired, so it's on my list to be eventually either rewritten or ported from DOSBox where I heard PIT mode 1 (which I presume the credits use) works.

The credits use PIT mode 0 - that way we don't need to touch the gate to start the pulse, we just load the PIT value and it immediately starts counting down, so it's just a single byte write to port 0x42 per sample.

And thanks for linking to that thred, found the link to your code there, and I'm certainly going to look at that.

Battler wrote:
I have the cycles variable, which is the total cycles of one block of execx86 (so CPU clock speed in Hz divided by 100) minus the cycles used by the instructions executed up to that point, it's globally accessible, so I guess I could do cycles & 0xf and then use that as the array index.

Sounds like it should work!

Battler wrote:
Yeah, this simulates that, it basically adds 4 cycles to the execution time, and then runs FETCHCOMPLETE() which seems to be roughly equivalent to a FETCHADD(4), so one fetch, as fetches are, even according to the official 808x documentation from Intel, one per 4 cycles. And I think this is done because clockhardware() which ends the timer period (and causes all device timers, including the PIT which then writes to DMA refresh) to be executed, is called after the prefetch queue byte adding is done. Maybe I should move it to before and get rid of the FETCHCOMPLETE()? I'll try.

The number of wait states introduced by a DRAM refresh can be anything from 0 to 6 cycles (assuming no wait state for the refresh memory access itself). The zero case happens when the bus would otherwise be idle. Other cases depending on when the refresh happens with respect to the adjacent CPU bus accesses.

Reply 14 of 26, by Battler

Posted on 2018-11-22, 16:04

Battler Offline

Rank Member

Rank: Member
Posts: 168
Joined: 2014-03-22, 21:27

The number of wait states introduced by a DRAM refresh can be anything from 0 to 6 cycles (assuming no wait state for the refresh memory access itself). The zero case happens when the bus would otherwise be idle. Other cases depending on when the refresh happens with respect to the adjacent CPU bus accesses.

How exactly do I determine when to return how many clces for the DRAM refresh wait state?

Reply 15 of 26, by reenigne

Posted on 2018-11-22, 16:28

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

Battler wrote:
How exactly do I determine when to return how many clces for the DRAM refresh wait state?

It's complicated! Here are some sniffer logs I made that show the possible timings:

1DMA states:
2  sDREQ
3  sHRQ, sHoldWait
4  sAEN
5  s0
6  s1
7  s2
8  s3
9  sWait
10  s4
11  sDelayedT1
12  sDelayedT2
13  sDelayedT3
14
15
16
1720FFF .p...  00F16 FF 00 FC .......
1820FFF Ip...  00F16 FF 00 FC .......                          I
1920FF1 SC...  00F16 FF 00 FC .......                          S F6E1         MUL CL
2000F17 .C...  00F17 FF 00 FC .......  T1
2120F17 .C...  00F17 FF 00 FC ..r....  T2
2220FF6 .p...  00F17 F6 00 FC ..r....  T3    F6 <-f [   00F17]
2320FF6 .C...  00F17 F6 00 FC .......  T4
2400F18 .C...  00F18 F6 00 FC .......  T1
2520F18 .C...  00F18 FF 01 FC ..r....  T2 S0                                          DREQ
2620FE1 .p...  00F18 E1 01 FC ..r....  T3 S0 E1 <-f [   00F18]                        passive
2720FE1 .p...  00F18 E1 01 FC .....D.  T4 S0                                          AEN
2820FE1 .p...  00F18 E1 01 FC .....D.     S0
2920FE1 .p...  00204 E1 10 FC .....D.     S1                                          DACK
3020FE1 .p...  00204 E1 10 FC ..r..D.     S2
3120FE1 .p...  00204 E1 10 FC .Wr..D.     S3 E1 <-d [   00204]
3220FE1 .p...  00204 E1 00 FC .....D.     S4                                          -DACK
3320FE1 .p...  00F18 E1 00 FC .......                                                 -AEN
34
3520FFF .p...  00F18 FF 00 FC .......
3620FFF Ip...  00F18 FF 00 FC .......                          I
3720FF1 SC...  00F18 FF 00 FC .......                          S F6E1         MUL CL
3800F19 .C...  00F19 FF 00 FC .......  T1
3920F19 .C...  00F19 FF 00 FC ..r....  T2
4020FF6 .p...  00F19 F6 00 FC ..r....  T3    F6 <-f [   00F19]
4120FF6 .C...  00F19 F6 00 FC .......  T4
4200F1A .C...  00F1A F6 01 FC .......  T1 S0                                          DREQ
4320F1A .C...  00F1A FF 01 FC ..r....  T2 S0
4420FE1 .p...  00F1A E1 01 FC ..r....  T3 S0 E1 <-f [   00F1A]                        passive
4520FE1 .p...  00F1A E1 01 FC .....D.  T4 S0                                          AEN
4620FE1 .p...  00F1A E1 01 FC .....D.     S0
4720FE1 .p...  00205 E1 10 FC .....D.     S1                                          DACK
4820FE1 .p...  00205 E1 10 FC ..r..D.     S2
4920FE1 .p...  00205 E1 10 FC .Wr..D.     S3 E1 <-d [   00205]
5020FE1 .p...  00205 E1 00 FC .....D.     S4                                          -DACK
5120FE1 .p...  00F1A E1 00 FC .......                                                 -AEN
52
5320FFF .p...  00F1A FF 00 FC .......
5420FFF Ip...  00F1A FF 00 FC .......                          I
5520FF1 SC...  00F1A FF 00 FC .......                          S F6E1         MUL CL
5600F1B .C...  00F1B FF 00 FC .......  T1
5720F1B .C...  00F1B FF 00 FC ..r....  T2
5820FF6 .p...  00F1B F6 00 FC ..r....  T3    F6 <-f [   00F1B]
5920FF6 .C...  00F1B F6 01 FC .......  T4 S0                                          DREQ
6000F1C .C...  00F1C F6 01 FC .......  T1 S0

…Show last 249 lines

6120F1C .C...  00F1C FF 01 FC ..r....  T2 S0
6220FE1 .p...  00F1C E1 01 FC ..r....  T3 S0 E1 <-f [   00F1C]                        passive
6320FE1 .p...  00F1C E1 01 FC .....D.  T4 S0                                          AEN
6420FE1 .p...  00F1C E1 01 FC .....D.     S0
6520FE1 .p...  00206 E1 10 FC .....D.     S1                                          DACK
6620FE1 .p...  00206 E1 10 FC ..r..D.     S2
6720FE1 .p...  00206 E1 10 FC .Wr..D.     S3 E1 <-d [   00206]
6820FE1 .p...  00206 E1 00 FC .....D.     S4                                          -DACK
6920FE1 .p...  00F1C E1 00 FC .......                                                 -AEN
70
7120FFF .p...  00F1C FF 00 FC .......
7220FFF Ip...  00F1C FF 00 FC .......                          I
7320FF1 SC...  00F1C FF 00 FC .......                          S F6E1         MUL CL
7400F1D .C...  00F1D FF 00 FC .......  T1
7520F1D .C...  00F1D FF 00 FC ..r....  T2
7620FF6 .p...  00F1D F6 01 FC ..r....  T3 S0 F6 <-f [   00F1D]                        DREQ, passive
7720FF6 .C...  00F1D F6 01 FC .......  T4 S0
7800F1E .C...  00F1E F6 01 FC .....D.  T1 S0                                          AEN
7920F1E .C...  00F1E F6 01 FC .....D.  T2 S0
8020F1E .C...  00207 F6 10 FC .....D.  T3 S1                                          DACK
8120F1E .C...  00207 F6 10 FC ..r..D.  Tw S2
8220F1E .C...  00207 F6 10 FC .Wr..D.  Tw S3 F6 <-d [   00207]
8320F1E .C...  00207 F6 00 FC .....D.  Tw S4                                          -DACK
8420FF6 .C...  00F1E F6 00 FC ..r....  Tw                                             -AEN  T1'
8520FE1 .C...  00F1E E1 00 FC ..r....  Tw                                                   T2'
8620FE1 .p...  00F1E E1 00 FC ..r....  Tw    E1 <-f [   00F1E]                              T3'
8720FE1 .p...  00F1E E1 00 FC .......  T4
88
8920FFF .p...  00F1E FF 00 FC .......
9020FFF Ip...  00F1E FF 00 FC .......                          I
9120FF1 SC...  00F1E FF 00 FC .......                          S F6E1         MUL CL
9200F1F .C...  00F1F FF 00 FC .......  T1
9320F1F .C...  00F1F FF 01 FC ..r....  T2 S0                                          DREQ
9420FF6 .p...  00F1F F6 01 FC ..r....  T3 S0 F6 <-f [   00F1F]                        passive
9520FF6 .C...  00F1F F6 01 FC .....D.  T4 S0                                          AEN
9600F20 .C...  00F1F F6 01 FC .....D.  T1 S0
9720F20 .C...  00208 F6 10 FC .....D.  T2 S1                                          DACK
9820F20 .C...  00208 F6 10 FC ..r..D.  T3 S2
9920F20 .C...  00208 F6 10 FC .Wr..D.  Tw S3 F6 <-d [   00208]
10020F20 .C...  00208 F6 00 FC .....D.  Tw S4                                          -DACK
10120FF6 .C...  00F20 F6 00 FC ..r....  Tw                                             -AEN  T1'
10220FE1 .C...  00F20 E1 00 FC ..r....  Tw                                                   T2'
10320FE1 .p...  00F20 E1 00 FC ..r....  Tw    E1 <-f [   00F20]                              T3'
10420FE1 .p...  00F20 E1 00 FC .......  T4
105
10620FFF .p...  00F20 FF 00 FC .......
10720FFF Ip...  00F20 FF 00 FC .......                          I
10820FF1 SC...  00F20 FF 00 FC .......                          S F6E1         MUL CL
10900F21 .C...  00F21 FF 01 FC .......  T1 S0                                          DREQ
11020F21 .C...  00F21 FF 01 FC ..r....  T2 S0
11120FF6 .p...  00F21 F6 01 FC ..r....  T3 S0 F6 <-f [   00F21]                        passive
11220FF6 .C...  00F21 F6 01 FC .....D.  T4 S0                                          AEN
11300F22 .C...  00F21 F6 01 FC .....D.  T1 S0
11420F22 .C...  00209 F6 10 FC .....D.  T2 S1                                          DACK
11520F22 .C...  00209 F6 10 FC ..r..D.  T3 S2
11620F22 .C...  00209 F6 10 FC .Wr..D.  Tw S3 F6 <-d [   00209]
11720F22 .C...  00209 F6 00 FC .....D.  Tw S4                                          -DACK
11820FF6 .C...  00F22 F6 00 FC ..r....  Tw                                             -AEN  T1'
11920FE1 .C...  00F22 E1 00 FC ..r....  Tw                                                   T2'
12020FE1 .p...  00F22 E1 00 FC ..r....  Tw    E1 <-f [   00F22]                              T3'
12120FE1 .p...  00F22 E1 00 FC .......  T4
122
12320FFF .p...  00F22 FF 00 FC .......
12420FFF Ip...  00F22 FF 00 FC .......                          I
12520FF1 SC...  00F22 FF 01 FC .......     S0                   S F6E1         MUL CL  DREQ
12600F23 .C...  00F23 FF 01 FC .......  T1 S0
12720F23 .C...  00F23 FF 01 FC ..r....  T2 S0
12820FF6 .p...  00F23 F6 01 FC ..r....  T3 S0 F6 <-f [   00F23]                        passive
12920FF6 .C...  00F23 F6 01 FC .....D.  T4 S0                                          AEN
13000F24 .C...  00F23 F6 01 FC .....D.  T1 S0
13120F24 .C...  0020A F6 10 FC .....D.  T2 S1                                          DACK
13220F24 .C...  0020A F6 10 FC ..r..D.  T3 S2
13320F24 .C...  0020A F6 10 FC .Wr..D.  Tw S3 F6 <-d [   0020A]
13420F24 .C...  0020A F6 00 FC .....D.  Tw S4                                          -DACK
13520FF6 .C...  00F24 F6 00 FC ..r....  Tw                                             -AEN  T1'
13620FE1 .C...  00F24 E1 00 FC ..r....  Tw                                                   T2'
13720FE1 .p...  00F24 E1 00 FC ..r....  Tw    E1 <-f [   00F24]                              T3'
13820FE1 .p...  00F24 E1 00 FC .......  T4
139
14020FFF .p...  00F24 FF 00 FC .......
14120FFF Ip...  00F24 FF 01 FC .......     S0                   I                      DREQ, passive
14220FF1 SC...  00F24 FF 01 FC .......     S0                   S F6E1         MUL CL
14300F25 .C...  00F25 FF 01 FC .....D.  T1 S0                                          AEN
14420F25 .C...  00F25 FF 01 FC .....D.  T2 S0
14520F25 .C...  0020B FF 10 FC .....D.  T3 S1                                          DACK
14620F25 .C...  0020B FF 10 FC ..r..D.  Tw S2
14720F25 .C...  0020B FF 10 FC .Wr..D.  Tw S3 FF <-d [   0020B]
14820F25 .C...  0020B FF 00 FC .....D.  Tw S4                                          -DACK
14920FFF .C...  00F25 FF 00 FC ..r....  Tw                                             -AEN  T1'
15020FF6 .C...  00F25 F6 00 FC ..r....  Tw                                                   T2'
15120FF6 .p...  00F25 F6 00 FC ..r....  Tw    F6 <-f [   00F25]                              T3'
15220FF6 .C...  00F25 F6 00 FC .......  T4
15300F26 .C...  00F26 F6 00 FC .......  T1
15420F26 .C...  00F26 FF 00 FC ..r....  T2
15520FE1 .p...  00F26 E1 00 FC ..r....  T3    E1 <-f [   00F26]
15620FE1 .p...  00F26 E1 00 FC .......  T4
157
15820FFF .p...  00F26 FF 00 FC .......
15920FFF .p...  00F26 FF 01 FC .......     S0                                          DREQ, passive
16020FFF Ip...  00F26 FF 01 FC .......     S0                   I
16120FF1 SC...  00F26 FF 01 FC .....D.     S0                   S F6E1         MUL CL  AEN
16200F27 .C...  00F26 FF 01 FC .....D.  T1 S0
16320F27 .C...  0020C FF 10 FC .....D.  T2 S1                                          DACK
16420F27 .C...  0020C FF 10 FC ..r..D.  T3 S2
16520F27 .C...  0020C FF 10 FC .Wr..D.  Tw S3 FF <-d [   0020C]
16620F27 .C...  0020C FF 00 FC .....D.  Tw S4                                          -DACK
16720FFF .C...  00F27 FF 00 FC ..r....  Tw                                             -AEN  T1'
16820FF6 .C...  00F27 F6 00 FC ..r....  Tw                                                   T2'
16920FF6 .p...  00F27 F6 00 FC ..r....  Tw    F6 <-f [   00F27]                              T3'
17020FF6 .C...  00F27 F6 00 FC .......  T4
17100F28 .C...  00F28 F6 00 FC .......  T1
17220F28 .C...  00F28 FF 00 FC ..r....  T2
17320FE1 .p...  00F28 E1 00 FC ..r....  T3    E1 <-f [   00F28]
17420FE1 .p...  00F28 E1 00 FC .......  T4
175
17620FFF .p...  00F28 FF 00 FC .......
17720FFF .p...  00F28 FF 01 FC .......     S0                                          DREQ, passive
17820FFF .p...  00F28 FF 01 FC .......     S0
17920FFF Ip...  00F28 FF 01 FC .....D.     S0                   I                      AEN
18020FF1 SC...  00F28 FF 01 FC .....D.     S0                   S F6E1         MUL CL
18100F29 .C...  0020D FF 10 FC .....D.  T1 S1                                          DACK
18220F29 .C...  0020D FF 10 FC ..r..D.  T2 S2
18320F29 .C...  0020D FF 10 FC .Wr..D.  T3 S3 FF <-d [   0020D]
18420F29 .C...  0020D FF 00 FC .....D.  Tw S4                                          -DACK
18520FFF .C...  00F29 FF 00 FC ..r....  Tw                                             -AEN  T1'
18620FF6 .C...  00F29 F6 00 FC ..r....  Tw                                                   T2'
18720FF6 .p...  00F29 F6 00 FC ..r....  Tw    F6 <-f [   00F29]                              T3'
18820FF6 .C...  00F29 F6 00 FC .......  T4
18900F2A .C...  00F2A F6 00 FC .......  T1
19020F2A .C...  00F2A FF 00 FC ..r....  T2
19120FE1 .p...  00F2A E1 00 FC ..r....  T3    E1 <-f [   00F2A]
19220FE1 .p...  00F2A E1 00 FC .......  T4
193
19420FFF .p...  00F2A FF 00 FC .......
19520FFF .p...  00F2A FF 01 FC .......     S0                                          DREQ, passive
19620FFF .p...  00F2A FF 01 FC .......     S0
19720FFF .p...  00F2A FF 01 FC .....D.     S0                                          AEN
19820FFF Ip...  00F2A FF 01 FC .....D.     S0                   I
19920FF1 SC...  0020E FF 10 FC .....D.     S1                   S F6E1         MUL CL  DACK
20000F2B .C...  0020E FF 10 FC ..r..D.  T1 S2
20120F2B .C...  0020E FF 10 FC .Wr..D.  T2 S3 FF <-d [   0020E]
20220F2B .C...  0020E FF 00 FC .....D.  T3 S4                                          -DACK
20320FFF .C...  00F2B FF 00 FC ..r....  Tw                                             -AEN  T1'
20420FF6 .C...  00F2B F6 00 FC ..r....  Tw                                                   T2'
20520FF6 .p...  00F2B F6 00 FC ..r....  Tw    F6 <-f [   00F2B]                              T3'
20620FF6 .C...  00F2B F6 00 FC .......  T4
20700F2C .C...  00F2C F6 00 FC .......  T1
20820F2C .C...  00F2C FF 00 FC ..r....  T2
20920FE1 .p...  00F2C E1 00 FC ..r....  T3    E1 <-f [   00F2C]
21020FE1 .p...  00F2C E1 00 FC .......  T4
211
21220FFF .p...  00F2C FF 00 FC .......
21320FFF .p...  00F2C FF 01 FC .......     S0                                          DREQ, passive
21420FFF .p...  00F2C FF 01 FC .......     S0
21520FFF .p...  00F2C FF 01 FC .....D.     S0                                          AEN
21620FFF .p...  00F2C FF 01 FC .....D.     S0
21720FFF Ip...  0020F FF 10 FC .....D.     S1                   I                      DACK
21820FF1 SC...  0020F FF 10 FC ..r..D.     S2                   S F6E1         MUL CL
21900F2D .C...  0020F FF 10 FC .Wr..D.  T1 S3 FF <-d [   0020F]
22020F2D .C...  0020F FF 00 FC .....D.  T2 S4                                          -DACK
22120FFF .C...  00F2D FF 00 FC ..r....  T3                                             -AEN  T1'
22220FF6 .C...  00F2D F6 00 FC ..r....  Tw                                                   T2'
22320FF6 .p...  00F2D F6 00 FC ..r....  Tw    F6 <-f [   00F2D]                              T3'
22420FF6 .C...  00F2D F6 00 FC .......  T4
22500F2E .C...  00F2E F6 00 FC .......  T1
22620F2E .C...  00F2E FF 00 FC ..r....  T2
22720FE1 .p...  00F2E E1 00 FC ..r....  T3    E1 <-f [   00F2E]
22820FE1 .p...  00F2E E1 00 FC .......  T4
229
23020FFF .p...  00F2E FF 00 FC .......
23120FFF .p...  00F2E FF 01 FC .......     S0                                          DREQ, passive
23220FFF .p...  00F2E FF 01 FC .......     S0
23320FFF .p...  00F2E FF 01 FC .....D.     S0                                          AEN
23420FFF .p...  00F2E FF 01 FC .....D.     S0
23520FFF .p...  00210 FF 10 FC .....D.     S1                                          DACK
23620FFF Ip...  00210 FF 10 FC ..r..D.     S2                   I
23720FF1 SC...  00210 FF 10 FC .Wr..D.     S3 FF <-d [   00210] S F6E1         MUL CL
23800F2F .C...  00210 FF 00 FC .....D.  T1 S4                                          -DACK
23920F2F .C...  00F2F FF 00 FC ..r....  T2                                             -AEN  T1'
24020FF6 .C...  00F2F F6 00 FC ..r....  T3                                                   T2'
24120FF6 .p...  00F2F F6 00 FC ..r....  Tw    F6 <-f [   00F2F]                              T3'
24220FF6 .C...  00F2F F6 00 FC .......  T4
24300F30 .C...  00F30 F6 00 FC .......  T1
24420F30 .C...  00F30 FF 00 FC ..r....  T2
24520FE1 .p...  00F30 E1 00 FC ..r....  T3    E1 <-f [   00F30]
24620FE1 .p...  00F30 E1 00 FC .......  T4
247
24820FFF .p...  00F30 FF 00 FC .......
24920FFF .p...  00F30 FF 01 FC .......     S0                                          DREQ, passive
25020FFF .p...  00F30 FF 01 FC .......     S0
25120FFF .p...  00F30 FF 01 FC .....D.     S0                                          AEN
25220FFF .p...  00F30 FF 01 FC .....D.     S0
25320FFF .p...  00211 FF 10 FC .....D.     S1                                          DACK
25420FFF .p...  00211 FF 10 FC ..r..D.     S2
25520FFF Ip...  00211 FF 10 FC .Wr..D.     S3 FF <-d [   00211] I
25620FF1 SC...  00211 FF 00 FC .....D.     S4                   S F6E1         MUL CL  -DACK
25700F31 .C...  00F31 FF 00 FC .......  T1                                             -AEN
25820F31 .C...  00F31 FF 00 FC ..r....  T2
25920FF6 .p...  00F31 F6 00 FC ..r....  T3    F6 <-f [   00F31]
26020FF6 .C...  00F31 F6 00 FC .......  T4
26100F32 .C...  00F32 F6 00 FC .......  T1
26220F32 .C...  00F32 FF 00 FC ..r....  T2
26320FE1 .p...  00F32 E1 00 FC ..r....  T3    E1 <-f [   00F32]
26420FE1 .p...  00F32 E1 00 FC .......  T4
265
26620FFF .p...  00F32 FF 00 FC .......
26720FFF .p...  00F32 FF 01 FC .......     S0                                          DREQ, passive
26820FFF .p...  00F32 FF 01 FC .......     S0
26920FFF .p...  00F32 FF 01 FC .....D.     S0                                          AEN
27020FFF .p...  00F32 FF 01 FC .....D.     S0
27120FFF .p...  00212 FF 10 FC .....D.     S1
27220FFF .p...  00212 FF 10 FC ..r..D.     S2
27320FFF .p...  00212 FF 10 FC .Wr..D.     S3 FF <-d [   00212]
27420FFF Ip...  00212 FF 00 FC .....D.     S4                   I
27520FF1 SC...  00F32 FF 00 FC .......                          S F6E1         MUL CL
27600F33 .C...  00F33 FF 00 FC .......  T1
27720F33 .C...  00F33 FF 00 FC ..r....  T2
27820FF6 .p...  00F33 F6 00 FC ..r....  T3    F6 <-f [   00F33]
27920FF6 .C...  00F33 F6 00 FC .......  T4
28000F34 .C...  00F34 F6 00 FC .......  T1
28120F34 .C...  00F34 FF 00 FC ..r....  T2
28220FE1 .p...  00F34 E1 00 FC ..r....  T3    E1 <-f [   00F34]
28320FE1 .p...  00F34 E1 00 FC .......  T4
284
285There are always at least 4 (sometimes 5 or 6) cycles of DREQ before the DACK goes active
286
287
28860000 .p...  00000 00 00 FC 6 .W.....  Tw    00 --> port[0000]
28960000 .W...  00000 00 01 FC 6 .......  T4                                           DREQ
29000001 .W...  00001 00 01 FC 7 .......  T1
29160000 .W...  00001 00 01 FC 7 .W.....  T2
29260000 .W...  00001 00 01 FC 6 .W.....  T3
29360000 .p...  00001 00 01 FC 6 .W.....  Tw    00 --> port[0001]                      passive
29460000 IC...  00001 00 01 FC 7 .......  T4                      I                    preAEN
29510A90 .C...  10A90 00 01 FC 7 .....D.  T1 S0                                        AEN
29660A90 .C...  10A90 00 01 FC 6 .....D.  T2 S1
29760A90 .C...  00000 00 10 FC 6 .....D.  T3 S2
29860A90 .C...  00000 00 10 FC 7 ..r..D.  Tw S3
29960A90 .C...  00000 00 10 FC 7 .Wr..D.  Tw S4 00 <-d [   00000]
30060A90 .C...  00000 00 00 FC 6 .....D.  Tw
30160A00 .C...  10A90 00 00 FC 6 ..r....  Tw
30260A00 .C...  10A90 00 00 FC 7 ..r....  Tw
30360A00 .p...  10A90 00 00 FC 7 ..r....  Tw    00 <-f [   10A90]
30460A00 .C...  10A90 00 00 FC 6 .......  T4
30510A91 .C...  10A91 00 00 FC 6 .......  T1
30660A91 SC...  10A91 FD 00 FC 7 ..r....  T2                      S 0400         ADD AL, 00
30760AEB .p...  10A91 EB 00 FC 7 ..r....  T3    EB <-f [   10A91]
30860AEB .C...  10A91 EB 00 FC 6 .......  T4

There's also code in https://github.com/reenigne/reenigne/blob/mas … 088/xtce/xtce.h that emulates this with cycle accuracy but I'm afraid it works by emulating the bus, CPU and DMAC cycle-by-cycle, there's no function that returns the number of cycles (0 to 6) to wait. The _dmaState switch in BusEmulator::wait() (line 1814) is probably the interesting bit.

Reply 16 of 26, by superfury

Posted on 2018-11-22, 18:55

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5489
Joined: 2014-03-08, 11:25
Location: Netherlands

@reenigne: such a cycle-by-cycle method is also used in UniPCemu(using the 14MHz clock ticks or hardware's own oscillator(e.g. sound blaster PCM). But since I don't have any prebuffer on the PIQ fetching data, combined with non-100% cycle accuracy on the EU, the SMC on the credits crashes executing partially modified/unmodified code.

The CGA runs like you said(all those delays), essentially adding waitstates until the horizontal conditions are met one by one accordingly(within the video-renderer) on T3/T4(don't remember which one atm, should be correct).

So the main issue in my emulator is the slight remaining EU timings being incorrect, causing all scanline-racing issues and credits crash.

Perhaps I could make an EU-added cycle log on 8088 MPH. That way we might find out what's timing incorrectly? So a cycle-log with EU timing on an extra tab(the EU delays being executed, so the counter of remaining EU clocks running instructions)?

Edit: Actually been thinking for a bit. Assuming timings are identical(e.g. both running at 3MIPS in IPS clocking mode), redirecting console output(or pipe) between emulators for validation of opcodes executed? So pipe disassembly, registers etc.(in common log format) from one to another emulator(e.g. cmd /C "UniPCemu_x64.exe pipeout | pcem.exe pipein") and letting the latter verify the instructions with it's own state? Could also be done on cycle-accurate mode?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 17 of 26, by superfury

Posted on 2018-11-23, 19:01

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5489
Joined: 2014-03-08, 11:25
Location: Netherlands

Also, in UniPCemu's DMA emulation, only SI, S0, S1, S2, S3 and S4, where S4 can become SI(or even further into S0), depending on a running block transfer(depending on if the bus is released).

https://bitbucket.org/superfury/unipcemu/src/ … dma.c?at=master

The only thing not emulated in UniPCemu(compared to reenigne's code) is the odd things(BIU T-cycle-based emulation) in instructions like HLT etc.

8088MPH reports 1539 cycles in my emulator.

Since 8088 MPH requires 1678(+/- 10) cycles, it's about 139 cycles off. Now where are those? What instruction is generating not enough cycles(are those HLT instructions etc. even used?)?

Edit: Hmmmm....
https://github.com/reenigne/reenigne/blob/mas … efrens.asm#L569

Seems like there's a HLT in there! Could that be what's throwing off most of the loop in my case?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 18 of 26, by reenigne

Posted on 2018-11-23, 22:46

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote:
Also, in UniPCemu's DMA emulation, only SI, S0, S1, S2, S3 and S4, where S4 can become SI(or even further into S0), depending on […]
Show full quote
Also, in UniPCemu's DMA emulation, only SI, S0, S1, S2, S3 and S4, where S4 can become SI(or even further into S0), depending on a running block transfer(depending on if the bus is released).

https://bitbucket.org/superfury/unipcemu/src/ … dma.c?at=master

The only thing not emulated in UniPCemu(compared to reenigne's code) is the odd things(BIU T-cycle-based emulation) in instructions like HLT etc.

8088MPH reports 1539 cycles in my emulator.

Since 8088 MPH requires 1678(+/- 10) cycles, it's about 139 cycles off. Now where are those? What instruction is generating not enough cycles(are those HLT instructions etc. even used?)?

Edit: Hmmmm....
https://github.com/reenigne/reenigne/blob/mas … efrens.asm#L569

Seems like there's a HLT in there! Could that be what's throwing off most of the loop in my case?

Does the code between kefrensScanline and kefrensScanlineEnd take 304 cycles? If not, then the problem isn't (only) with HLT.

Chances are that most of the problems are in the interaction between memory/port bus accesses and prefetch bus accesses. How the CPU decides exactly when to start each seems to be extremely complicated. Way more so than it has any right to be - there's a 560-line function in XTCE (busInit()) which handles this, all determined by trial and error, and it seems like it should be just a handful of lines. I'm wondering if there are some vestiges of the 8086's 16-bit BIU and prefetch queue that are causing some of this complexity. I will have to have a play and see if I can come up with a simpler algorithm once I've reworked it so that I can test one version of XTCE against another rather than against the real hardware (slow) or against a massive file full of testcase measurement data (unwieldy - the file would be too big to fit into RAM with all the variations I really want to test).

Reply 19 of 26, by superfury

Posted on 2018-11-23, 23:39

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5489
Joined: 2014-03-08, 11:25
Location: Netherlands

@reenigne: If your EU cycle counts for those instructions(the totals, excluding specific timings in between(e.g. part before and part after memory access only being applied after a memory transaction(e.g. mov al,[bx]), then those should be working(theoretically). The CRTC should be functioning correctly(and everything related, like CPU waitstates for horizontal timings etc.), seeing as the 4K colors part still works.

I'd need a start CS:IP address(e.g. segment:offset) to check said code, though, to start checking it.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Main menu