UniPCemu cycle accurate 8088 implementation

Reply 160 of 198, by superfury

Posted on 2019-05-19, 13:12

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

So perhaps the STOSW instruction is failing for some unknown reason?

Also, see my previous post that I edited in just before your post about the missing timings in UniPCemu. Could those be the timings that cause UniPCemu's CPU to miss some EU cycles, thus resulting in the CPU being too fast(and 8088 MPH reporting too few PIT cycles having elapsed during it's startup speed check)?

The difficult part in that one is that UniPCemu starts taking 1 EU cycles until the BIU has completed it's transfer(if there actually is a BIU memory transfer). So that amounts to waiting for T1(completing the current transfer, if it's not there already) and waiting for the DMA transfer to release the bus(if any is running), then tick to T1(if nothing was running), ticking the transfer through to T4(the memroy transfer itself), after which at T1 the BIU returns success at the EU(the EU running before the BIU starts up a next transfer at it's T1 clock).

So should I just add those cycles in the BUSinit function to UniPCemu's timings for said instructions(ignoring the timings within the if instructions, since they're probably already done by the BIU as said parallel process)?

So, ignoring those BIU-related timings, should I add the following timings for those instructions using those _accessNumber settings?

11,6=1
22=1
33=2
44=3
55=2(INT 3 instruction) or 3 otherwise
67=1
78=1
89=1
910=3
1011=2
1112,13=4
1214=4
1315=2
1416=2
1517=1
1618,19=3
1720,21,24=1
1822,23=2
1925=2
2026=2
2127,32,37=3
2228=1
2329,30=4
2431=6
2533=4
2634,39,41=4
2735=2
2836=5+m(1 if memory)
2938=5
3040=6
3142=3
3243=3
3344,45=2
3446=2
3547,48,49,50,51=1
3652=2
3753=1
3854=2
3955=1
4056=1
4157,58=4+m(2 if memory)
4259=5+m(1 if memory)
4360=4
4462=1
4565=3+m(1 if memory)
4668=1
4770=5

These are my current EU instruction emulation timings for the 808X core:
https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c

Look for cycles_OP for the timings I've implemented for those instructions so far. Although all timings are implemented at only one point during the instruction, no two locations(before and after the memory access cycles). Is that an issue(it's mainly the way UniPCemu handles all BIU transfers, using requests(which are accepted, transferred and returned, all the time ticking the EU in some NOP cycles(of 1 cycle at a time))?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 161 of 198, by reenigne

Posted on 2019-05-19, 13:49

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote:
Edit: Looking a bit further, it seems all those timings concerning "_accessNumber = " are the timings that are probably missing from UniPCemu(except perhaps the CS/IP-related timings, which don't seem to match at all, perhaps because it's based on one of the earlier replies in this thread instead). So perhaps I would need to take all those timings and add them to UniPCemu's timings for said instruction?

Yes, that represents the limits of my knowledge of the 8088 cycle-exact timings at the moment. It's not perfect yet (there are some corner cases I haven't written tests for yet) but it does give the same results as the real hardware for millions of testcases.

Having said that, _accessNumber is a horrible hack and busInit() really should be much simpler. Some trial-and-error will be needed to find those simplifications. I have a suspicion that the key may be taking into account the 8088's 8086 heritage and original 16-bit bus width into account. The even and odd bytes of the prefetch queue might not be symmetrical. There is also some official documentation in the "8086 Instruction Sequence" section starting on page 4-37 of the iAPX 86,88 User's Manual, which I recently discovered covers some details of the timing that I had previously only discovered by observation, but which is documented differently to my model. In particular, that document says (third paragraph in second column of page 4-37):

Instead of completing the opcode fetch and forcing the EU to wait four additional clock cycles, the BIU immediately aborts the fetch cycle (resulting in two idle clock cycles (T_I) in clock cycles 19 and 20) and performs the required memory write. This interaction between the EU and BIU results in a single clock extension to the execution time of the PUSH AX instruction, the maximum delay that can occur in response to an EU bus cycle request.

I had observed these two idle clock cycles before but hadn't modelled them as an aborted fetch cycle. Doing so might simplify the code significantly.

Reply 162 of 198, by superfury

Posted on 2019-05-23, 22:28

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

reenigne wrote:
Yes, that represents the limits of my knowledge of the 8088 cycle-exact timings at the moment. It's not perfect yet (there are s […]
Show full quote
superfury wrote:
Edit: Looking a bit further, it seems all those timings concerning "_accessNumber = " are the timings that are probably missing from UniPCemu(except perhaps the CS/IP-related timings, which don't seem to match at all, perhaps because it's based on one of the earlier replies in this thread instead). So perhaps I would need to take all those timings and add them to UniPCemu's timings for said instruction?

Yes, that represents the limits of my knowledge of the 8088 cycle-exact timings at the moment. It's not perfect yet (there are some corner cases I haven't written tests for yet) but it does give the same results as the real hardware for millions of testcases.

Having said that, _accessNumber is a horrible hack and busInit() really should be much simpler. Some trial-and-error will be needed to find those simplifications. I have a suspicion that the key may be taking into account the 8088's 8086 heritage and original 16-bit bus width into account. The even and odd bytes of the prefetch queue might not be symmetrical. There is also some official documentation in the "8086 Instruction Sequence" section starting on page 4-37 of the iAPX 86,88 User's Manual, which I recently discovered covers some details of the timing that I had previously only discovered by observation, but which is documented differently to my model. In particular, that document says (third paragraph in second column of page 4-37):

Instead of completing the opcode fetch and forcing the EU to wait four additional clock cycles, the BIU immediately aborts the fetch cycle (resulting in two idle clock cycles (T_I) in clock cycles 19 and 20) and performs the required memory write. This interaction between the EU and BIU results in a single clock extension to the execution time of the PUSH AX instruction, the maximum delay that can occur in response to an EU bus cycle request.

I had observed these two idle clock cycles before but hadn't modelled them as an aborted fetch cycle. Doing so might simplify the code significantly.

What do you mean with 'The even and odd bytes of the prefetch queue might not be symmetrical'? Is there even a concept of even and odd bytes in a PIQ(a circular buffer of sorts)? What's symmetrical about a PIQ? It can't be the data stored within it(instruction data), as that would corrupt the entire instruction stream? What do you mean with that?

Also, how do you suppose I implement this, seeing as UniPCemu has an entirely different way of working(a simple 1-command FIFO for sending requests to the PIQ(empty only when not busy handling an command and only fillable when ready), the reverse for it's result FIFO(containing 1 for writes or x for the read memory data(at reaching T1 completing the memory transfer).
UniPCemu's BIU simply spins on 'T1' during DMA or when it has nothing to do(PIQ full and no I/O requests from the EU). The EU spins as well waiting for the BIU to finish/get ready to receive a command, during requesting memory accesses using the request PIQ to get empty, as well as during the result of the BIU to get filled(ticking in 1 cycle increments). The basic commands the BIU receives from the EU are just a few things: memory read byte/word/dword, memory write byte/word/dword, bus read byte/word/dword and bus write byte/word/dword. And of course the BIU, when it has nothing to do on T1 and the PIQ isn't full, it just starts a memory fetch to fill the PIQ(dword on 80386+, word on 8086+, byte on 8088; dword->word->byte also being dependant on physical memory alignment of course). And finally there's DMA, which will currently take the bus between any byte/word/dword transfer, unless the LOCK prefix or XCHG is used.

With a DMA transfer, T3 will release the bus(happens always), DMA will take it at that cycle and take the bus and delay(performing the first S0 one cycle too early), starting S0 at the CPU's blocking loop's T1 cycle and clears the delay flag at said cycle, then at the 'T2''s time(of course the CPU doesn't tick, it just patiently waits in T1 state for DMA to release the bus and be able to continue it's next request/PIQ fetch) it ticks the actual S0, then at the next cycles it's the usually documented DMA states(S1-S5, looping S5=S5+SI+S0 without releasing the bus during block transfer/burst mode). So that simulates the usual DMA in a compatible way(at S4/S1). Of course bus locking blocks the DMA from taking the bus until the EU finishes the entire instruction(including trailing cycles).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 163 of 198, by superfury

Posted on 2019-05-24, 13:58

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Is the REP prefix used in the 8088 MPH credits? I just found a bug causing REP instructions that went past their first iteration of the instruction(and decreasing CX because the instruction was finished) caused all other instructions to effectively become NOP instructions with 1-cycle timings 😖 Simply becauses the repeating check forgot to reset the execution phase handler to start a new instruction(it thought the instruction was finished, and thus it had nothing to do, thus not calling the instruction handler anymore).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 164 of 198, by superfury

Posted on 2019-05-24, 18:04

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Just noticed something interesting after fixing the REP prefix. Now that the instructions that are used with it actually execute(instead of the second byte/word/dword onwards essentially transferring nothing and have no execution instruction phase while still counting 1 cycle each on the BIU), certain parts of the demo simply skip now? Like the Delorean sprite demo, also immediately after the first 3D pyramid and the vectorballs part as well?

Edit: Tried it again with the latest updates(which also fixes the prefetch buffer clearing and jumping back to the start of the REPeated instruction). It now also properly uses the REP instruction, not prefetching anymore during said instruction's runtime(over multiple instructions being executed that way for CX count times). Now I notice the sprite recompiler somehow failing completely, with the Delorean car disappearing against the background completely(might be a timing issue, though)?

The vector balls still have a lot of noise in the background besides the vector balls' moving area? It seems to switch between two different static backgrounds?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 165 of 198, by superfury

Posted on 2019-06-01, 23:20

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Just found out a tiny 'bug' in the CGA/MDA timings. It was properly applying waitstates and address wrapping to memory writes, but not applying it to memory reads as well. That might account for some timings and drifts?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 166 of 198, by superfury

Posted on 2020-03-11, 18:10

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

I've just been looking at https://github.com/reenigne/reenigne/blob/mas … r/8088/8088.txt again.

I see those IN and OUT instructions mentioning 4+4/8(for AL/AX using DX) and 6+4(for AL/AX).

You've mentioned the 2 cycles before the transfer(and 1 cycle waitstate)? I'd assume the second 4/8 is the actual transfer to the port(T1-T4). Why does it mention the first being 4 or 6? Didn't you mention 2 cycles only?

Edit: At least, the most recent changes(proper fetching termination and perhaps most of the IN/OUT instructions(except using DX) using a 1-cycle startup(up to 4 cycles to complete the current(4 cycles when starting at T1) or previous(when at T2-T4, which is until it reaches T1 again), with the 1 cycle waitstate on the bus transactions(in/out) and the E4-E7 opcodes a 2-cycle idle bus after those cycles) increases the metric cycle count to ~1545. The rolling over fake text screen at the start of the demo (after the calibration screen) also seems to run without visible issues now?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 167 of 198, by reenigne

Posted on 2020-03-12, 10:13

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote on 2020-03-11, 18:10:

I've just been looking at https://github.com/reenigne/reenigne/blob/mas … r/8088/8088.txt again.

Pay no attention to that document - it's an older one and the timings there are just from the published ones if I recall correctly. https://github.com/reenigne/reenigne/blob/mas … 088/xtce/xtce.h is the second best source of timing information I know of right now, the best being the XT Server and ISA bus sniffer.

Reply 168 of 198, by superfury

Posted on 2020-03-12, 11:44

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

OK. But those 1-cycle startup after the start of IN/OUT, 2-cycle for non-DX IN/OUT BIU idle following that and 1-cycle waitstate on all I/O bus operations is correct?

8088MPH reports 1547 metric cycles now.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 169 of 198, by reenigne

Posted on 2020-03-12, 12:12

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote on 2020-03-12, 11:44:

OK. But those 1-cycle startup after the start of IN/OUT, 2-cycle for non-DX IN/OUT BIU idle following that and 1-cycle waitstate on all I/O bus operations is correct?

8088MPH reports 1547 metric cycles now.

The PC/XT motherboard imposes a 1-cycle waitstate on all port IO instructions, yes.

As for the other questions, I'm not sufficiently familiar with your model of the 8088's timings to say for sure. But take a look at http://www.reenigne.org/misc/inout_sniffer.zip for some ISA bus sniffer logs of tricky sequences involving IN and OUT. If your emulator has the same timings for these sequences, it's probably correct for these instructions.

Reply 170 of 198, by superfury

Posted on 2020-03-12, 13:13

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Do you have some sniffer log of the metric cycle check done at the start of the demo(as well as it's starting address, perhaps the instruction jumping to it for the first time)?

Edit: Btw, UniPCemu now(new addition to CPU emulation!) waits for the prefetch timings to complete before starting the request(on that cycle) for the instruction execution(e.g. memory accesses or port I/O part of the instruction, it's execution phase). Previously it did the request and start of I/O/MMU request cycle (T1) during the last (few) cycles of the prefetch fetch for execution(essentially overlapping them incorrectly).

Now, after the prefetch cycles, the execution phase(normal instruction handling) starts executing the cycle(s) after that, timing properly.

So for port I/O(BUS as UniPCemu calls it, the other one being MMU/memory), it's one cycle normal behaviour by the BIU(prefetching if T1), then wait for T1 again(if not T1 after that yet), then the 2-cycle idle(for non-DX), then the actual T1-T2-T3-Tw(only 1)-T4 cycles, finishing the instruction on reaching T1(which will start to prefetch, if possible, which it will due to the instruction not lasting long enough to fill it fully again).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 171 of 198, by reenigne

Posted on 2020-03-12, 13:55

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote on 2020-03-12, 13:13:

Do you have some sniffer log of the metric cycle check done at the start of the demo(as well as it's starting address, perhaps the instruction jumping to it for the first time)?

Unfortunately that's easier said than done - the ISA bus sniffer only has 2kB of RAM on the ATMega328 microcontroller that runs it, limiting it to runs of 2048 cycles. I could capture it with multiple runs but it'd be a bit of work.

However, bear in mind that the 8088MPH speed test was never meant to be an emulator torture test - it was just sufficiently sensitive to tell IBM PC/XTs from contemporary machines with similar (but not identical) timings. Getting the speed test to pass doesn't mean that all instructions are correct, nor will even guarantee that the rest of the demo will run correctly. These logs of carefully curated patterns are much more of a torture test. I plan to make an actual emulator torture test out of them soon.

Reply 172 of 198, by superfury

Posted on 2020-03-13, 13:43

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

What about just splitting the 8088 MPH speed test part into managable chunks for the bus sniffer(of course starting with a jump to get the correct start cache situation)? If I can run those as some kind of BIOS on UniPCemu in cycle-accurate mode, it might be able to find the offending instructions not taking up enough time(seeing as the count with UniPCemu is less than what it should be on a real 8088)?

Also, having such smaller chunks to test with might at least indicate what opcodes are actually acting up?

Edit: Btw, don't you have some kind of list of all opcodes that are used in xtce.h and their counts for the different parts of the instruction? Like it is atm I have to keep jumping up and down the code for those _accessNumber methods of accessing memory/bus, which is kind of confusing trying to verify it againt other emulators? And from what I remember, some other emulators that copy said behaviour use almost exactly the same method, which is confusing as hell, having to jump up and down the code just to find out one instruction's time(or a group of related timings)?

Edit: Just implemented the timings up to and including INCDEC(at least the 16-bit versions) in your code(skipping 00-3B for now), then implemented all timings using jumpNear/jumpShort from your code.

Is it really true that the conditional jumps and normal jumps using jumpShort seem to wait for T1 in the middle of the instruction?

Edit: Applying the missing 00-3B opcode timings, it reports 1546 cycles right now. Somthing's still missing obviously.
'

Last edited by superfury on 2020-03-13, 17:57. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 173 of 198, by Alegend45

Posted on 2020-03-13, 16:49

Alegend45 Offline

Rank Newbie

Rank: Newbie
Posts: 77
Joined: 2012-06-23, 18:18

superfury wrote on 2020-03-13, 13:43:
What about just splitting the 8088 MPH speed test part into managable chunks for the bus sniffer(of course starting with a jump […]
Show full quote

What about just splitting the 8088 MPH speed test part into managable chunks for the bus sniffer(of course starting with a jump to get the correct start cache situation)? If I can run those as some kind of BIOS on UniPCemu in cycle-accurate mode, it might be able to find the offending instructions not taking up enough time(seeing as the count with UniPCemu is less than what it should be on a real 8088)?

Also, having such smaller chunks to test with might at least indicate what opcodes are actually acting up?

Edit: Btw, don't you have some kind of list of all opcodes that are used in xtce.h and their counts for the different parts of the instruction? Like it is atm I have to keep jumping up and down the code for those _accessNumber methods of accessing memory/bus, which is kind of confusing trying to verify it againt other emulators? And from what I remember, some other emulators that copy said behaviour use almost exactly the same method, which is confusing as hell, having to jump up and down the code just to find out one instruction's time(or a group of related timings)?

Edit: Just implemented the timings up to and including INCDEC(at least the 16-bit versions) in your code(skipping 00-3B for now), then implemented all timings using jumpNear/jumpShort from your code.

Is it really true that the conditional jumps and normal jumps using jumpShort seem to wait for T1 in the middle of the instruction?

You could try looking at 86Box's 808x.c code, as it does pass the 8088MPH check, and it runs the demo just fine all the way through 😜

Reply 174 of 198, by reenigne

Posted on 2020-03-13, 17:22

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote on 2020-03-13, 13:43:

What about just splitting the 8088 MPH speed test part into managable chunks for the bus sniffer(of course starting with a jump to get the correct start cache situation)?

Like I said, that's the tricky bit. The queue and bus state also need to be in the right state for each instruction.

superfury wrote on 2020-03-13, 13:43:
If I can run those as some kind of BIOS on UniPCemu in cycle-accurate mode, it might be able to find the offending instructions not taking up enough time(seeing as the count with UniPCemu is less than what it should be on a real 8088)?

Taking the right number of cycles is only part of the puzzle, though. Each instruction also has to leave the bus and prefetch queue in the right state. You might have some instructions that are taking too long as well. So for your purposes it is better to do a lot of small tests than one big one like the 8088 MPH benchmark.

superfury wrote on 2020-03-13, 13:43:
Edit: Btw, don't you have some kind of list of all opcodes that are used in xtce.h and their counts for the different parts of the instruction?

Unfortunately not as the 8088 timings are more complicated than that.

superfury wrote on 2020-03-13, 13:43:
Like it is atm I have to keep jumping up and down the code for those _accessNumber methods of accessing memory/bus, which is kind of confusing trying to verify it againt other emulators?

_accessNumber is a hack which I would like to get rid of, once I've figured out how to do so and keep the same timings. I have some ideas about how to do this, but I need to sit down with it for a while and work through it. However, as I am having to cancel plans all over the place for pandemic-related reasons, I might have time to do this quite soon.

superfury wrote on 2020-03-13, 13:43:
And from what I remember, some other emulators that copy said behaviour use almost exactly the same method, which is confusing as hell, having to jump up and down the code just to find out one instruction's time(or a group of related timings)?

That may be because the emulator that you are thinking of uses XTCE's code (with my blessing).

Edit: Just implemented the timings up to and including INCDEC(at least the 16-bit versions) in your code(skipping 00-3B for now), then implemented all timings using jumpNear/jumpShort from your code.

superfury wrote on 2020-03-13, 13:43:
Is it really true that the conditional jumps and normal jumps using jumpShort seem to wait for T1 in the middle of the instruction?

That is the best explanation that I have so far been able to come up with based on the observed behaviour. I'm hoping that once I sort out the _accessNumber mess then a lot of other things like that can be done in a way that seems more likely to reflect what the chip is actually doing.

Reply 175 of 198, by superfury

Posted on 2020-03-13, 18:32

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

About UniPCemu's cycle-accurate method, it's pretty simple:
- The BIU runs after each EU cycle(EU only when not sleeping) in UniPCemu. The EU posts request for the current cycle(s) to execute on the BIU. After posting it, it essentially starts to sleep for n cycles(leaving only the BIU to run). Once the BIU has finished it's job(reached T1 again from T4), it posts it's result in a reply buffer(data read for reads, 1 for writes). Once the BIU has processed all requird cycles, the EU starts running again.
- The BIU runs without stop, either processing T1-T2-T3-Tw(times * )-T4 cycles(when doing a memory access) or stuck at T1 with the bus idle when doing nothing. It will fill the prefetch queue when it's empty enough, otherwise(without request from the EU), it will idle. And of course, requests from the EU have priority over prefetching bytes/words/dwords from memory(which depend on the PIQ free size in bytes and the current PIQ prefetch physical memory address).

https://bitbucket.org/superfury/unipcemu/src/ … PCemu/cpu/biu.c

Most timings by the EU are simply kept using various instruction and internal counters(instruction counters for the instruction level itself, internal counters for the common instruction handlers and interrupt handling).
Said instruction counters are simply instruction state registers, which increment by 1 for each step, each 2 steps being a request(even step number) and the other of the 2 steps being the result(odd step number) retrieval).
Said request and result functions can either pass(return 0) to make the instruction continue on or skip already done steps, or fail(return 1) to make the instruction abort and wait for the BIU to become ready for the step.

For example the add instruction to modr/m uses this:

1byte CPU8086_instructionstepreadmodrmw(word base, word *result, byte paramnr)
2{
3	byte BIUtype;
4	if (CPU[activeCPU].modrmstep==base) //First step? Request!
5	{
6		if ((BIUtype = modrm_read16_BIU(&params,paramnr,result))==0) //Not ready?
7		{
8			CPU[activeCPU].cycles_OP += 1; //Take 1 cycle only!
9			CPU[activeCPU].executed = 0; //Not executed!
10			return 1; //Keep running!
11		}
12		++CPU[activeCPU].modrmstep; //Next step!
13		if (BIUtype==2) //Register?
14		{
15			++CPU[activeCPU].modrmstep; //Skip next step!
16		}
17		else //Memory?
18		{
19			BIU_handleRequests(); //Handle all pending requests at once when to be processed!
20		}
21	}
22	if (CPU[activeCPU].modrmstep==(base+1))
23	{
24		if (BIU_readResultw(result)==0) //Not ready?
25		{
26			CPU[activeCPU].cycles_OP += 1; //Take 1 cycle only!
27			CPU[activeCPU].executed = 0; //Not executed!
28			return 1; //Keep running!
29		}
30		++CPU[activeCPU].modrmstep; //Next step!
31	}
32	return 0; //Ready to process further! We're loaded!
33}

So, for example, a simple load&store instruction goes like this(e.g. ADD [0],12h):
- Request [0]. When failing(BIU not ready yet for a request, still busy handling something), waits for the BIU(reaches line 8 in the code above) to abort the instruction(return in c). When success, continue on to check the result(isn't there for new requests), reaching line 12.
- Check the result. If there's a result, read it(BIU is now ready for a new request). Registers are already read during the previous request step(returning the value 2). Otherwise, abort the instruction until it is ready to read the result(reaching line 26).
- Once the result is successfully read, the result of the function is 0, allowing the caller to continue handling the next step of the instruction timing.

The same kind of request/result method (setting CPU[activeCPU].executed to 0 to not finish the instruction and make the BIU tick some) is basically used for any timing in the EU core.

For the EU core itself, see: https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c

The basics are pretty simple: look up the opcode for the opcode(CPU8086_OPxx), which will either handle the instruction using said functions mentioned above(the CPU_instructionstep* functions), or it will call the generic handlers that are shared by multiple instructions(the CPU_internal_* functions, e.g. for ADD/SUB/XOR/CMP/XCHG, as well as some misc instructions(XLAT, the string instructions, adjustment instructions(DAA/DAS/AAD/AAS), RET(F) instructions, INTO, LxS(LDS/LES) and far call(which is partly handled externally, in the 80386 protected mode handlers(protection.c, see function segmentWritten's else clause at its end(mostly for compatiblity with 80286+ segment writes and jumps/calls etc.)))).

Edit: So currently I'm 114 cycles short in the 8088 MPH 1546 metric cycle count. That's still quite a lot cycles missing(114 cycles or so)?

As can be seen in BIU.c, the 808X requests are pretty much handled on T1 always, so requests while it's not at T1 yet makes the EU delay the request until T1 is actually reached automatically(it might place the request, but the BIU will only start handling it when the state becomes T1 again(finishing the prefetch operation or the previous request). Although normally(as can be seen) the requests are finished by the EU after the BIU posts it's result value(what's read or 1 for writes), so the request should always get posted(although the BIU will finish the prefetch it's handling before getting to said request and clearing the buffer). And since the BIU will sleep the EU while it's handling previous instructions timings (e.g. the cycles from a MUL instruction), the EU won't start checking again until it's cycles are fully handled, making it sync properly now(previously it didn't do this properly with the start of an instruction, which was an obvious bug).

Essentially it's like the EU(the client) talking to the BIU(the server). That's essentially the way it's built. Of course the EU keeps track of it's executed state using some simply counters(for different kinds of steps), which increase by 1 for request/response or by 2 with functions which simply deal with timings and delays(e.g. CPU8086_instructionstepdelayBIUidle). Those are split up as mentioned above, with a special seperated counter for normal steps and modr/m ones(for modr/m based steps).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 176 of 198, by superfury

Posted on 2020-03-13, 18:57

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Thinking about it, can't I just run 86Box with some logging and compare it to UniPCemu's during the metric cycle count part? That would at least indicate to me what might be wrong?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 177 of 198, by superfury

Posted on 2020-05-18, 23:25

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Hmmm... Was just searching for some other info to use, then found some of reenigne's posts on vcfed.org.

There, he says(http://www.vcfed.org/forum/showthread.php?319 … er-device/page2):

reenigne wrote:
The BIU usually starts a new prefetch 1 cycle after a byte is grabbed from the queue (if the queue is full and a byte is grabbed on cycle 0 then cycle 1 will usually be a T1 state for the next prefetch). Exceptions I've seen so far: "POP rw", "POP segreg", "POPF" and "RET" (start bus cycle for the stack fetch on cycle 3), "IN AL, DX" and "IN AX, DX" (start bus cycle for port IO on cycle 3). In all these cases, the CPU knows from the opcode that it's going to need a fetch, has the address in a register, and knows which register from the opcode.

Is that 2-cycle delay now already implemented in UniPCemu(what we've been talking about a few posts back, with the IN/OUT and it's variations)? Or is this something I still need to add to the BIU model somehow? Currently, the BIU will start a prefetch after the fetching of an opcode from the BIU, unless the EU's execution phase(for 1-byte opcodes) prevents it from doing so(by delaying the BIU's operation by n cycles).
UniPCemu for e.g. IN AL,DX(assuming the PIQ is full) fetches the opcode on the first cycle, then on the same cycle(execution starts on the next cycle). The next cycle(first execution step) lets the BIU start a prefetch transfer(if it's T1 and no DMA request(which is given priority if so), which it probably is(T2 at this example), starting a prefetch bus transaction if T1). Then on the cycle after that(the 3rd cycle since(including) the PIQ being read, in this case T3), it posts a request for a 8-bit port read to the BIU, which the BIU will start on the very next T1 cycle it's reaching(there's a Tw and T4 cycles before that, though). So is that behaviour you've mentioned there already emulated in UniPCemu?
Or do I need to add a prevent for prefetching T1 cycle when the EU grabs a byte from prefetch(Prevent T1 and PIQ->EU fetch at the same clock from occurring)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 178 of 198, by superfury

Posted on 2020-05-22, 16:21

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Just found some more timing issues on the 808X:
D0/D1: 4 cycles for memory, 1 cycles for non-memory. This was previously 5 for memory, 3 for non-memory.
D2/D3: 9+4*shift for for memory, 7+4*shift for non-memory. This was previously (5+4*shift)-8 for memory, 3+4*shift for register. Of course, this would end up with negative timing for memory operands(5+(4*0)-8=-3 cycles), so it should have been obvious that's incorrect.

That's based on my observation of xtce.h:

11 for non-memory
2d0/d1: 4 for memory
3d2/d3: 9 for memory, 6 for non-memory
4d2/d3: 4 for each shift
5
6d0/d1: 1(non-memory)+4(memory)
7d2/d3: 1(non-memory)+(9(memory)/6(non-memory))+4*shift

Is this correct?

Edit: 8088MPH reports 1628/1629 cycles now. That's only 39/40 PIT cycles off, or 156/160 8088 cycles off.

Of course, 40 PIT cycles sounds awfully like a rounded number. Is that coincidence? Or is it an indication of what's going wrong? Reenigne?

Edit: Just found a bug in the HLT REP-depending timing. The 1 cycle being applies was inversed(1 when repeating, 0 otherwise, instead of 0 when repeating, 1 otherwise). So 2 when not repeating, 1 otherwise instead of 1 when not repeating, 2 otherwise.
Edit: Although it doesn't change the 1628/1629 metric cycle count.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 179 of 198, by superfury

Posted on 2020-05-22, 20:07

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5822
Joined: 2014-03-08, 11:25
Location: Netherlands

Just implemented the PUSH and POP timings. It now decreases to a metric cycle cound of 1613.
That's 55 PIT cycles off.

Edit: Implementing the GRP5 PUSH r/m instruction timings(the 1 cycle before it starts) pushes it down to 1611 metric cycles? So that's 57 PIT cycles off.

Although I haven't looked at all the JMP and CALL instructions yet(other than the conditional jumps).

Edit: Further improving the PUSH (seg)reg instructions to be delaying 3 cycles before the push and none after(effectively 1 for completion) causes it to drop further, to 1608 cycles? So it's 60 PIT ticks off now(240 CPU cycles)?
Edit: With POPF added, it now stands at 1605 metric cycles... So that's 63 PIT ticks off(252 CPU cycles)... It looks that the accurate it becomes, the lower the metric cycle count drops?

What I do in these cases now is basically until the access starts(which seems to be your busState checks in busInit()) is count the cycles before that and ignore everything after that(as UniPCemu performs T1 synchronization automatically by waiting for the BIU to accept the request). The issue might be(in this case) that the finishing of the instruction executes the instruction handler as well, which will cause the EU to tick 0 cycles, but the CPU performs a 1-cycle tick instead, as it cannot handle 0 cycles ticking. Perhaps that's part of the issue?

Edit: Just adjusted the BIU ticking on the end of an instruction to allow the 0-cycle state(which usually only happens when finishing an instruction and nothing is left to be done). It simply doesn't tick the BIU in that case(allowing it to be absorbed into the next instruction fetching process instead).

Edit: It(the 0-cycle BIU ticking being skipped) makes it drop even further: to 1587 metric cycles.
Although now the finishing of the transfer of an instruction should no longer cause a 1-cycle tick on the BIU anymore, which is a good thing(so 4 cycles for a transfer are actually 4 cycles instead of 5 EU(1 cycles request, 3 cycles execution, 1 cycle result is now discarded with the 0-cycle discarding by the CPU generic handler) cycles). So the BIU won't tick one extra cycle if the CPU has 0 cycles to tick at the end of an instruction(or any other location for that matter).

So T1-T4 for a memory access, and instead of ticking another T1 cycle before starting the next instruction on the EU(T1 ticking the final instruction phase reading the result from the BIU), the EU ticks a 0 cycle BIU instead(so the BIU doesn't tick, nor does the hardware) and the next instruction fetch from the PIQ to the EU starts on the next T1 cycle again in that case. Of course, T1 will still tick if the instruction handler indicates that it wants the EU to tick some final clocks before finishing the instruction(in which case the EU is stopped(does nothing anymore), waiting for the BIU to catch up for those remaining cycles before starting the next instruction fetch of the first instruction byte for execution).

So it's now(e.g. ADD memoryaddress,01h; only the writeback phase):
...
(clock ticks for the execution timing of the instruction)
(waiting for the BIU to reach T1)
T1 start writeback
T2
T3
T4 finish writeback(active BIU cycle), instruction finishes with a 0-cycle count(inactive BIU cycle, skipped by the CPU).
T1 instruction byte fetch from memory when the BIU is ready(no DMA transfer started). PIQ fetch into the EU is performed if the PIQ isn't empty.
T2 when the PIQ wasn't full on the previous instruction or DMA transfer active. Usually busy on a PIQ fetch from memory. The EU could be requesting more bytes, ticking the EU each cycle.
(... transfer if PIQ was empty on the previous instruction).
T4 result is stored in the PIQ(when a transfer was busy).
T1 EU gets the new byte from the PIQ or the PIQ is checked by the BIU if it's supposed to fetch from memory.
etc.
Eventually the EU has the entire instruction, after which timing starts behaving according to the instruction again, each request being accepted anytime the BIU isn't doing work for the CPU, it starting the access when DMA is inactive and the BIU is in T1 state(it's either idle(no DMA transfer) or DMA until T1 for the PIQ fetch or EU request is starting to handle it's T1).

That's what's happening with the new model. Previously, the T1 after the writeback T4 would cause the EU to stall for 1 cycle when the instruction had nothing to do or time after the final transfer of an instruction. This doesn't happen now anymore, thus another metric cycle count drop(to 1587 metric cycles).

So it's 324 8088 cycles short now.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Main menu