UniPCemu cycle accurate 8088 implementation

Reply 60 of 198, by superfury

Posted on 2017-04-09, 13:52

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

Vladstamate, one thing I notice is odd though: Even when substracting 4 cycles/memory access, the ADD timing still doesn't match up with your emulator's timing. You use 3 cycles for all cases, but according to substracting 4 cycles/memory access this is different:
register,register: 3
register,memory: 9-4=5
memory,register: 16-(4*2)=8
register,immediate: 4
memory,immediate: 17-(4*2)=9
accumulator,immediate: 4

Why are those timings in your emulator so different? How did you arrive at 3 cycles for all cases?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 61 of 198, by vladstamate

Posted on 2017-04-09, 15:47

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

It is and it is not wrong.

For example "acc, imm" and 'reg,imm" while they say 4 the execution is really 3 because you spend 1 cycle for fetching the immediate from the prefetch queue. For the other ones, I do not remember what my reasoning was, but it does seem 3 might be wrong.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 62 of 198, by superfury

Posted on 2017-04-09, 17:47

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

If the immediate takes 1 cycle in those given timings, then what about the modr/m parameters? Since a modr/m parameter has variable length(1-3 fetches/cycles)? The problem we're faced next is: how did Intel arrive at the given timings in their manual? If we knew that, much, if not all, EU timings could be accurately extrapolated from the documented timings?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 63 of 198, by superfury

Posted on 2017-04-09, 21:20

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I still need to change the BIU and DMA to be able to work together at the cycle level. Currently:
- The EU ticks first, resulting in 1 or more EU cycles and a possible BIU request(always 1 EU cycle).
- The BIU ticks off EU cycles. State T1-T4 is looped one tick each EU cycle. If DMA isn't occupying the BUS, each T4 state loads/stores to/from memory(request)/IO(request)/prefetch(defaulted).
- Finally, the DMA ticks states, like the BIU, from S1-S4 in the same loop. MMU requests are started on S1. S4 releases the BUS to the CPU.
- After this, all other hardware tick as well.

That is the basic core loop which, after a few ticks, is synchronized to realtime using a high-resolution clock(PSP RTC, Windows high resolution clock or (failsafe) getTimeOfDay(ms/ns resolution)).

So, essentially, the EU performs a step(1+ cycles), then the BIU ticks that same time, then DMA and finally all other hardware.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 64 of 198, by superfury

Posted on 2017-04-09, 22:49

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I've just added result bytes to the 8086 internal handlers to allow more accurate step timings(blocking the rest of the opcode handler from executing during the current step.

MIPS 1.10 now reports 1.05, 1.03, 1.08, 1.02, 1.19, 1.09. I'm getting close, according to it 😀

8088MPH reports 1523 cycles(9%)?

Edit: The credits crash horribly, because the self-modifying code is failing(not enough prefetched)? Why would that happen? Is there a problem with the way it prefetches? Or is it the low cycle timings? Why would it be too fast? Decreasing EU cycles even more than pure memory will lead to inaccurate timings(decrease the 1523 cycles to be even lower?), according to what I understand?

Last edited by superfury on 2017-04-09, 23:17. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 65 of 198, by vladstamate

Posted on 2017-04-09, 23:14

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

superfury wrote:
If the immediate takes 1 cycle in those given timings, then what about the modr/m parameters? Since a modr/m parameter has variable length(1-3 fetches/cycles)? The problem we're faced next is: how did Intel arrive at the given timings in their manual? If we knew that, much, if not all, EU timings could be accurately extrapolated from the documented timings?

The reason immediate read should take an extra cycle is because to read the immediate from prefetch queue the CPU has to decode the instruction fully. And that takes at least a cycle. In CAPE I specifically keep the mod/rm reading and decoding and the initial opcode decoding in same 1 cycle. So all this takes 1. The immediate, if necessary is separate.

Also one thing to keep in mind is that the immediate might have to be read at the end of instruction as is in the case of

ADD [mem], Imm

Since there is a lot of work to do before we even attempt to read the imm: such as the EA needs to calculate address (which might involve reading a displacement) and then bytes have to be read and only then you can queue a request for an immediate. And that will take yet another cycle, for BIU to respond and all that.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 66 of 198, by superfury

Posted on 2017-04-09, 23:39

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I'm currently taking 1 cycle to fetch each byte of the instruction, while each prefix adds 2 cycles to their step. Finally, it waits EA cycles and after that starts the execution phase(opcode handler), which terminates the decode phase.

Execution phase basically is handled like this, in general:
1. Read modr/m or direct memory(opcodes A0/A1) bytes using BIU, when used.
2. Execute the instruction.
3. Write data back to the modr/m or direct memory, when used.

Reading or writing memory/IO(BUS) uses two steps:
1. Issue a request to the BIU. When unsuccessful(BIU busy), abort. Otherwise, proceed to step 2.
2. Read the result(always 1 during writes) from the BIU. When successful(BIU finished), increase step to skip future access until next instruction. Otherwise, abort instruction handling.

Each abort takes 1 EU cycle. Otherwise 0 cycles(BIU transaction completed or finished in a previous step). Multiple step counters are used for different subphases(modr/m i/o, internal handler, bus i/o or opcode handler). The calls automatically 'expire' themselves by increasing the step counter into the next part range(e.g. steps 0-1 is first operand, steps 2-3 is second operand, steps 4-5 is writeback first operand, steps 6-7 is writeback second operand, step 8=no memory action, as is the case with an XCHG instruction. The internal step is used to step the surrounding parts/blocks(check(protection used for 286+, always continue on 80(1)8X), read, execute(swap), writeback).

Current code: https://bitbucket.org/superfury/unipcemu/src/ … 086.c?at=master

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 67 of 198, by superfury

Posted on 2017-04-12, 09:07

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I've just adjusted the 8086+ instructions(up to the 80286 core) to implement 80286 documented timings as well. It now should generate proper timings for those, but due to 286+ memory waitstates(1 waitstate memory) still being unsupported, it runs just a bit too fast, getting a counter of F7FF, which is just a little bit too slow for the IBM AT BIOS to run(it needs about 60 more).

Edit: After modifying the 80186 core to use the BIU, it now results in a cycle count of F800(which still isn't enough), during the DMA refresh speed test.
Edit: Just modified the 80286 core as well. It now still reports F800, but it requires at least F8A7, so A7 PIT timers are still required? So A7h*4=668(d). So at least 668 cycles are missing somewhere? Could this purely be the DMA that isn't working accurately enough?

Edit: I've just modified the BIU to block the CPU while it's handling the cycles given from the EU. It now will handle the EU-provided cycles on a cycle-by-cycle basis and rewrite the amount of cycles to handle for the other hardware to 1 cycle. Thus the BIU is generating a 1 cycle clock signal for the hardware(VGA, DMA etc.), while also responding at full accuracy(to DMA cycles, which always start on T1 and last until T4 for a single byte/word(IBM AT) transfer). So it will still receive the amount of cycles to spend from the EU, while taking them apart, single-stepping through it 1 cycle at a time, giving the other hardware time to respond to it each cycle(needed for DMA accuracy). The DMA simply keeps a 2-bit counter that starts any available transfers on it being 0(S1, requesting the BUS at that cycle), while terminating the transfer(and releasing the BUS to the CPU) on it's S4 state. The BIU will watch the status of the DMA and won't request any new data to/from any hardware while the DMA is keeping the bus busy. The exchange between DMA and CPU is simply handled through a simply byte variable(0=BUS available, 1=CPU has BUS, 2=DMA has BUS). Once either requests the bus, it changes to 1 or 2. When it's done, it changes back to 0. When the CPU or DMA wants to take the BUS, it won't take it when it isn't already having access(e.g. the BIU seeing 2 or the DMA seeing 1). This causes a simply exchange between the CPU and DMA, sharing the BUS between them(with the CPU getting the first chance to take the BUS when idle(state 0)).

Edit: 8088 MPH now reports 1653 cycles(1%). It seems I'm getting closer now.
Edit: I've just modified the BIU to start transfers of any memory/BUS at T1, then finishing up and giving the result at T4. This should synchronize it with the DMA controller. Waitstates shift the T state, keeping it at T4 until all waitstates are finished.
Edit: 8088 MPH now reports 1639 cycles(2%). So by synchronizing the BIU and DMA controllers, it now decreases in accuracy, although it should be more correct?
Edit: I've just modified the DMA controller to be 100% cycle-accurate, now transferring using SI and S0-S4 states(which is mostly the same emulation as until now, but divided into seperate cycle states):

1	DMA states:
2	SI: Sample DRQ lines. Set HRQ if DRQn=1.
3	S0: Sample DLDA. Resolve DRQn priorities.
4	S1: Present and latch upper address. Present lower address.
5	S2: Activate read command or advanced write command. Activate DACKn.
6	S3: Activate write command. Activate Mark and TC if apprioriate.
7	S3: _Ready_ _Verify_: SW sample ready line keeps us into S3. Else, proceed into S4.
8	S4: Reset enable for channel n if TC stop and TC are active. Deactivate commands. Deactivate DACKn, Mark and T0. Sample DRQn and HLDA. Resolve DRQn priorities. Reset HRQ if HLDA=0 or DRQ=0(Goto SI), else Goto S1.

Although the S3 _Ready_ _Verify_ state handling isn't implemented. Also, DACKn is activated and automatically deactivated(strobed) at S2, so it isn't deactivated on S4.

8088 MPH now reports 1609 cycles(4%)? If the hardware becomes more accurate, why would the cycle count keep decreasing? Does this mean there's a problem with the cycles applied in the EU core(They're not taking enough cycles to execute)?

Edit: Of course, the demo crashes once it reaches the credits. Is this purely because the cycle-accurate emulation is too fast(1609<~1673 cycles)? But if it's too fast and only the memory timings are substracted from the documented timings, does that mean that those 'memory cycles', that are supposed to be included in those timings, are actually EU execution cycles themselves?
Edit: Disabling those memory cycles makes 8088 MPH respond with 1744 cycles. Averaged it's 1676.5, which should be in that range. So that means only half of the memory cycles are included?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 68 of 198, by superfury

Posted on 2017-04-14, 11:57

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I've just modified the BIU to apply the CPU bus checking ownership of the bus(against the DMA controller) to block the entire CPU cycle (keeping it stuck at T3) when DMA is busy. This increases the 8088 MPH cycle count to 1638(2%, although quickly tested on VGA graphics card emulation instead of CGA. CGA reports 1636 cycles). It's getting closer. Only 32 cycles off so far(assuming adding 2% is the actual amount of cycles to expect). Anyone has any idea what these 32 cycles might be? Could it just be the (I)DIV instructions that aren't fully cycle accurate yet?

There's also the strange case of 8088 MPH crashing when reaching the credits, due to the Self Modifying Code in the credits messing up, actually overwriting the next instruction that's still to be loaded into the BIU for some reason(Although the EU timings and fetching timings should protect against that?). Can anyone of you see what's wrong with the timing?

EU phase processing(all the steps that the EU uses during the exection phase. Also provides EU timing itself for all instructions(cycles_OP is actually EU timing, which the BIU spends together with other specified timings(like direct memory I/O etc., which is still used for interrupts and 80286+ protection handling))): https://bitbucket.org/superfury/unipcemu/src/ … 086.c?at=master
BIU stepping(Steps 1 BIU&EU cycle at a time): https://bitbucket.org/superfury/unipcemu/src/ … biu.c?at=master

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 69 of 198, by superfury

Posted on 2017-04-14, 15:41

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I've currently gotten the following timings implemented(not including 1 cycle timings during fetching and EA cycles):
* means that the cycles are substracted by 4 because of 1 read or write access.
** means that the cycles are substracted by 8 because of 1 read and 1 write access(writeback phase).

(*** Is followed by the timing as in your emulator that's different(whether it's correct or not))

Implemented UniPCemu EU execution timings:

1Global helper handlers:
2INT3: 52(TODO) ***12
3INTX: 52(TODO) ***12
4IRET: 24 ***11
5CMP byte/word: 4 for accumulator(***3), 9* for modr/m addressing memory***8, 3 otherwise. 10* for modr/m with immediate addressing memory (***5/6), 4 otherwise. 18(*x2) for memory-memory(***14).
6INC/DEC word: 15** for memory(***7), 2 for reg.
7INC/DEC byte: 15** for memory(***7), 3 for reg(***2).
8AND/OR/XOR/ADD/ADC/SUB/SBB byte/word: 3 for reg-reg, 4 for accumulator, modr/m2: 9*-16**-3, modr/m3: 4-17**-3. (***always 3)
9TEST byte: 3 for reg-reg(***2), 4 for accumulator, modr/m2: 9*(***4/6)-3, modr/m3: 5-11*-3. (***always 4)
10MOV byte/word to/from register: 0 for reg-reg(unused?), 10* for accumulator from memory, modr/m2: 8*-2, modr/m3: 10*-2(***4), 4 for register-register(***2), 2 for segreg to/from register, 8*(***From: 1, To: 6) for segreg to/from memory
11MOV byte/word to/from memory: custom(to/from) immediate address(opcodes A0-A3)): 10*
12MOV byte/word to/from memory(other): 0 for reg-reg(unused?), 10* for accumulator, modr/m2: 9*-2 ***6, modr/m3: 10*-4(***6), Register immediate-register(non-existant?) 4, 2 for segreg to/from register***6, 9* for segreg to/from memory.
13DAA/DAS/AAA/AAS: 4
14CBW: 2
15CWD: 5
16MOVS byte/word: 9+17** for first in REP, 17** for second+ in REP, 18** for non-REP. (***always 2)
17CMPS byte/word: 9+22(*x2) for first in REP, 22(*x2) for second+ in REP, 22(*x2) for non-REP. (***Always 14)
18STOS byte/word: 9+10* for first in REP, 10* for second+ in REP, 11* for non-REP. (***Always 4(+10/REP???))
19LOSD byte/word: 9+13* for first in REP, 13* for second+ in REP, 12* for non-REP. (***Always 4(+10/REP???))
20SCAS byte/word: 9+15* for first in REP, 15* for second+ in REP, 15* for non-REP. (*** Always 4(+10/REP???))
21RET: 12* for immediate, 8* for non-immediate.
22RETF: 17(*x2) for immediate, 18(*x2) for non-immediate.
23INTO(TODO): 53 for taken, 4 for not-taken. (***This is always 12? It's supposed to be 4 when not taken!!!)
24AAM: 83
25AAD: 60
26XLAT: 11*
27XCHG byte/word: 0 for unknown, 3 for accumulator(***4), 17(**x2) for reg-mem(***4), 4 for modr/m reg-reg.
28LXS(LDS, LES, also used for 16-bit LFS, LGS, LSS on newer CPUs): memory: 16(*x2), register(special case): 2. (***Always 1)
29
30-------------------------------------------------
31Normal instruction handlers(specific for all opcodes 00h-FFh):
32PUSH Segreg: 10*
33POP Segreg: 8* (***5)
34PUSH reg: 11*
35POP reg: 8* (***5)
3670-7F conditional jumps: 16 if taken, 4 if not taken. (*** 4 if taken, 0 if not taken??? How does this compare to the manual???)
37LEA: General mov timings + 2. (***Always 2)
38CALL intersegment direct(9A): 28 (*** Always 5?)
39PUSHF: 10*
40POPF: 8* (***5)
41SAHF: 4
42LAHF: 4
43MOV modr/m immediate byte/word(C6/C7): 10* for memory, 4 for reg.
44SALC: 2 (***4)
45LOOPNZ: 19(***14) for taken, 5(***1) for not taken.
46LOOPZ: 18(***14) for taken, 6(***1) for not taken.
47LOOP: 17 for taken, 5 for not taken.
48JCXZ: 18 for taken(***10), 6 for not taken(***2).
49IN AL/AX,imm8: 10*
50OUT imm8,AL/AX: 10* (***3)
51CALL(E8): 19* (***11)
52JMP(E9): 15 (***8)
53JMP(EA): 15 (***8)
54JMP(EB): 15 (***8)
55IN AL/AX,DX: 8*
56OUT DX,AL/AX: 8*
57HLT/CMC/CLC/STC/CLI/STI/CLD/STD: 2 (***HLT: 3)
58
59Then the remaining GRP opcodes and 8F instruction:
608F: 17** for memory, 8* for register. (***Always 5)

…Show last 28 lines

61Coprocessor opcodes(not connected): 8 for memory, 2 for register. (***1)
62
63GRP2 byte/word:
64- Reg/mem with 1 shift: 15** for memory, 2 for reg. (***2)
65- Reg/mem with variable shift: 20**(***8)+(cnt*4) for memory, 8+(cnt*4) for reg.
66- Reg/mem with immediate variable shift(80186+) is the same as second case.
67
68- Note: (I)DIV and IMUL are inaccurate and unknown. Maximum timings taken from the documentation:
69DIV byte: 96* for memory(***78), 90(***80) for reg.
70IDIV byte: 118*(***107) for memory, 112(***101) for reg.
71
72GRP3 instructions:
73NOT byte/word: 16** for memory, 3 for reg.
74NEG byte: 16** for memory, 3 for reg.
75MUL byte: 76*(***70) for memory, 70 for reg. Add more bits set than 1 to cycles(2 bits adds 1, 3 bits adds 2 etc., ***not implemented this way).
76IMUL byte: 86*(***80) for memory, 80 for reg.
77DIV word: 168*(***134) for memory, 162(***144) for reg.
78IDIV word: 190*(***171) for memory, 184(***165) for reg.
79MUL word: 124*(***108) for memory, 118(***108) for reg. Add more bits set than 1 to cycles(see MUL byte, ***not implemented this way).
80IMUL word: 128*(***128) for memory, 134 for reg(***128).
81
82GRP5 instructions:
83GRP5 /2: 21* for memory, 16 for reg. (***Always 8)
84GRP5 /3: 37(*x2) for memory, 28 for reg. (***Always 5)
85GRP5 /4: 18* for memory, 11 for reg. (***Always 8)
86GRP5 /5: 24(*x2) for memory, 11 for reg. (***Always 8)
87GRP5 /6: 16**(***4) for memory, 11* for reg.

All other timings (push/pop/memory are 1 cycle for requesting it and 1 cycle for every tick until the BIU is finished fetching it(on cycle T4)). The operation itself starts at cycle T1. DMA transfers stop the BIU for at least 5 cycles(SI(DREQ 1-cycle loop) transfers to S0,S1,S2,S3,S4 cycles). SI simply sees there's something to do, S0-S4 actually do something, while S0 takes the BUS and S4 releases the BUS(when there's nothing more to transfer). Interrupt timings are still processed using the old method. The data/result from the BIU is received by the EU the cycle after it's finished(so on T1).

Edit: I've implemented a simple difference between your and my emulator cycles, mentioned in cycles of your emulator compared to mine(which is to the left of the comment, before (***. Always means that your emulator has an unchanging cycle count, undependant on anything, while mine does make a difference in that.

Last edited by superfury on 2017-04-17, 08:55. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 70 of 198, by superfury

Posted on 2017-04-16, 20:15

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

Can you see if my emulation has troubles with those cases? Or does your emulator have bugs in (some of) those cases?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 71 of 198, by vladstamate

Posted on 2017-04-16, 23:51

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

It just means I need to re-visit my timings. I did not have much time recently between my job, travel and kids 🙁

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 72 of 198, by superfury

Posted on 2017-04-22, 16:28

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I've just modified the DMA to take 1 more cycle during the S0 state to 'wait' for HLDA after requesting it(which is essentially just a simple toggable switch which switches the virtual BUS between used by the CPU, used by the DMA and idle). So now instead of executing the S0 state in 1 cycle(virtually executing a request and HLDA in 1 cycle), it now will take the bus the first cycle and wait one cycle before 'acnowledging' the HLDA(which has already happened during requesting and taking the bus from the CPU(which was idling) the previous cycle)). This causes the detected 8088 MPH timing to increase to it's 1% timings again, at 166X cycles. So it's very close now(I've based this on the information at the start of reengine's information at https://github.com/reenigne/reenigne/blob/mas … /8088/notes.txt ).

So this means I'm getting really close, according to that cycle count. Assuming it's only needing about 1673 cycles to be fully accurate, I'm only about 10 cycles off now.

Edit: It's running at 1662 cycles now, which is slower again due to the extra DMA cycle taken when starting a transfer and requesting the BUS from the CPU(1 cycle becoming 2 cycles in total before advancing to S1 state). Although the full S1-S2 states aren't needed to be used(as the address on the BUS isn't directly emulated. It wouldn't make sense to, since it's directly passed to the function when accessing the memory directly. Simply delaying enough after it(during the following S* states) should be enough in this case). So only about 10 cycles left until perfection, according to the metric cycle count. Probably either 8 or 12, seeing as the PIT ticks every 4 cycles.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 73 of 198, by superfury

Posted on 2017-04-23, 11:42

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I've tried to implement an universal (unsigned) divide routine according to reenigne's documentation at Few questions about 8088MPH .

This is what I've gotten so far:

1//Universal DIV instruction for x86 DIV instructions!
2/*
3
4Parameters:
5	val: The value to divide
6	divisor: The value to divide by
7	result: Result container
8	modulo: The modulo container
9	error: 1 on error(DIV0), 0 when valid.
10	resultbits: The amount of bits the result contains(16 or 8 on 8086).
11	SHLcycle: The amount of cycles for each SHL.
12	ADDSUBcycle: The amount of cycles for ADD&SUB instruction to execute.
13
14*/
15void CPU8086_internal_DIV(uint_32 val, word divisor, word *result, word *modulo, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle)
16{
17	uint_32 temp, temp2, currentresult; //Remaining value and current divisor!
18	sbyte shift; //The shift to apply! No match on 0 shift is done!
19	temp = val; //Load the value to divide!
20	if (divisor==0) //Not able to divide?
21	{
22		*result = 0;
23		*modulo = temp; //Unable to comply!
24		*error = 1; //Divide by 0 error!
25		CPU[activeCPU].cycles_OP += 1; //We're taking 1 cycle for this!
26		return; //Abort: division by 0!
27	}
28
29	temp = val; //Load the remainder to use!
30	*result = 0; //Default: 
31	nextstep:
32	//First step: calculate shift so that (divisor<<shift)<=remainder and ((divisor<<(shift+1))>remainder)
33	temp2 = divisor; //Load the default divisor for x1!
34	if (temp2<temp) //Not enough to divide? We're done!
35	{
36		goto gotresult; //We've gotten a result!
37	}
38	currentresult = 1; //We're starting with x1 factor!
39	for (shift=0;shift<(resultbits+1);++shift) //Check for the biggest factor to apply(we're going from bit 0 to maxbit)!
40	{
41		if ((temp2<=temp) && ((temp2<<1)>temp)) //Found our value to divide?
42		{
43			CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 more SHL cycle for this!
44			break; //We've found our shift!
45		}
46		temp2 <<= 1; //Shift to the next position!
47		currentresult <<= 1; //Shift to the next result!
48		CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 SHL cycle for this! Assuming parallel shifting!
49	}
50	if (shift==(resultbits+1)) //We've overflown? We're too large to divide!
51	{
52		*error = 1; //Raise divide by 0 error due to overflow!
53		return; //Abort!
54	}
55	//Second step: substract divisor<<n from remainder and increase result with 1<<n.
56	temp -= temp2; //Substract divisor<<n from remainder!
57	*result += currentresult; //Increase result(divided value) with the found power of 2 (1<<n).
58	CPU[activeCPU].cycles_OP += ADDSUBcycle; //We're taking 1 substract and 1 addition cycle for this(ADD/SUB register take 3 cycles)!
59	goto nextstep; //Start the next step!
60	//Finished when remainder<divisor or remainder==0.

…Show last 5 lines

61	gotresult: //We've gotten a result!
62	*modulo = temp; //Give the modulo! The result is already calculated!
63	*error = 0; //We're having a valid result!
64}

Although it assumes two timings provided(the combination of the add and sub timings, and the single shift timings(for the SHL subcode, which is executed in parallel for both the remainder to substract(divisor<<n) and result to add(1<<n))) and that the shifting to calculate the result to substract(from the remainder to get the new remainder) and add(to the result) is executed in parallel(Shift to the next position/result block).

Is this somewhat correct? Should signed division be handled the same way(but instead of executing directly, saving the sign bit before and give the correct sign after(simple toggle) on both divident and modulo, while limiting it to one bit less than the unsigned call)?
With sign bit I mean simply save both signs of EAX/AX and divisor, convert to positive numbers(NEG&INC) when negative(calculate result and modulus sign by XOR of the sign of the divisor and divident), execute IDIV using normal DIV and one less bit(15/7 or 31/15 bits instead of 16/8 and 32/16 bits(Divisor/Divident as well as Result/Modulo)), then restore the sign using NEG&INC when needed?

Edit: A simple sign addition to support IDIV instructions as well using the same algorithm:

1void CPU8086_internal_IDIV(uint_32 val, word divisor, word *result, word *modulo, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle)
2{
3	byte resultnegative, remaindernegative; //To toggle the result and apply sign after and before?
4	resultnegative = remaindernegative = 0; //Default: don't toggle the result not remainder!
5	if (((val>>31)!=(divisor>>15))) //Are we to change signs on the result? The result is negative instead! (We're a +/- or -/+ division)
6	{
7		resultnegative = 1; //We're to toggle the result sign if not zero!
8	}
9	if (val&0x80000000) //Negative value to divide?
10	{
11		val = ((~val)+1); //Convert the negative value to be positive!
12		remaindernegative = 1; //We're to toggle the remainder is any, because the value to divide is negative!
13	}
14	if (divisor&0x8000) //Negative divisor? Convert to a positive divisor!
15	{
16		divisor = ((~divisor)+1); //Convert the divisor to be positive!
17	}
18	CPU8086_internal_DIV(val,divisor,result,modulo,error,resultbits-1,SHLcycle,ADDSUBcycle); //Execute the division as an unsigned division!
19	if (*error==0) //No error has occurred? Do post-processing of the results!
20	{
21		if (resultnegative) //The result is negative?
22		{
23			*result = (~*result)+1; //Apply the new sign to the result!
24		}
25		if (remaindernegative) //The remainder is negative?
26		{
27			*modulo = (~*modulo)+1; //Apply the new sign to the remainder!
28		}
29	}
30}

It essentially precalculates the resulting signs, converts to positive values, executes a normal unsigned division and finally negates the results where needed to give the correct results.

Is this behaviour correct?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 74 of 198, by vladstamate

Posted on 2017-04-23, 13:24

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

I like your algorithm. If it is ok with you, I will use it in CAPE.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 75 of 198, by superfury

Posted on 2017-04-23, 13:52

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I'm still testing it, though. So far(testing until booting MS-DOS, where it crashes for some unknown reason with this new algorithm), normal unsigned division seems to be working without problems(it gives the correct results both in 16-bit and 8-bit results). Maybe there's still a problem with the signed division for some reason. I'm currently looking into it.

Edit: Looking at the current code, MS-DOS is now booting again after a few little calling and handling bugfixes:

1//Universal DIV instruction for x86 DIV instructions!
2/*
3
4Parameters:
5	val: The value to divide
6	divisor: The value to divide by
7	result: Result container
8	modulo: The modulo container
9	error: 1 on error(DIV0), 0 when valid.
10	resultbits: The amount of bits the result contains(16 or 8 on 8086).
11	SHLcycle: The amount of cycles for each SHL.
12	ADDSUBcycle: The amount of cycles for ADD&SUB instruction to execute.
13
14*/
15void CPU8086_internal_DIV(uint_32 val, word divisor, word *result, word *modulo, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle, byte *applycycles)
16{
17	uint_32 temp, temp2, currentresult; //Remaining value and current divisor!
18	byte shift; //The shift to apply! No match on 0 shift is done!
19	temp = val; //Load the value to divide!
20	*applycycles = 1; //Default: apply the cycles normally!
21	if (divisor==0) //Not able to divide?
22	{
23		*result = 0;
24		*modulo = temp; //Unable to comply!
25		*error = 1; //Divide by 0 error!
26		return; //Abort: division by 0!
27	}
28
29	if (CPU_apply286cycles()) /* No 80286+ cycles instead? */
30	{
31		SHLcycle = ADDSUBcycle = 0; //Don't apply the cycle counts for this instruction!
32		*applycycles = 0; //Don't apply the cycles anymore!
33	}
34
35	temp = val; //Load the remainder to use!
36	*result = 0; //Default: 
37	nextstep:
38	//First step: calculate shift so that (divisor<<shift)<=remainder and ((divisor<<(shift+1))>remainder)
39	temp2 = divisor; //Load the default divisor for x1!
40	if (temp2>temp) //Not enough to divide? We're done!
41	{
42		goto gotresult; //We've gotten a result!
43	}
44	currentresult = 1; //We're starting with x1 factor!
45	for (shift=0;shift<(resultbits+1);++shift) //Check for the biggest factor to apply(we're going from bit 0 to maxbit)!
46	{
47		if ((temp2<=temp) && ((temp2<<1)>temp)) //Found our value to divide?
48		{
49			CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 more SHL cycle for this!
50			break; //We've found our shift!
51		}
52		temp2 <<= 1; //Shift to the next position!
53		currentresult <<= 1; //Shift to the next result!
54		CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 SHL cycle for this! Assuming parallel shifting!
55	}
56	if (shift==(resultbits+1)) //We've overflown? We're too large to divide!
57	{
58		*error = 1; //Raise divide by 0 error due to overflow!
59		return; //Abort!
60	}

…Show last 46 lines

61	//Second step: substract divisor<<n from remainder and increase result with 1<<n.
62	temp -= temp2; //Substract divisor<<n from remainder!
63	*result += currentresult; //Increase result(divided value) with the found power of 2 (1<<n).
64	CPU[activeCPU].cycles_OP += ADDSUBcycle; //We're taking 1 substract and 1 addition cycle for this(ADD/SUB register take 3 cycles)!
65	goto nextstep; //Start the next step!
66	//Finished when remainder<divisor or remainder==0.
67	gotresult: //We've gotten a result!
68	if (temp>((1<<resultbits)-1)) //Modulo overflow?
69	{
70		*error = 1; //Raise divide by 0 error due to overflow!
71		return; //Abort!		
72	}
73	*modulo = temp; //Give the modulo! The result is already calculated!
74	*error = 0; //We're having a valid result!
75}
76
77void CPU8086_internal_IDIV(uint_32 val, word divisor, word *result, word *modulo, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle, byte *applycycles)
78{
79	byte resultnegative, remaindernegative; //To toggle the result and apply sign after and before?
80	resultnegative = remaindernegative = 0; //Default: don't toggle the result not remainder!
81	if (((val>>31)!=(divisor>>15))) //Are we to change signs on the result? The result is negative instead! (We're a +/- or -/+ division)
82	{
83		resultnegative = 1; //We're to toggle the result sign if not zero!
84	}
85	if (val&0x80000000) //Negative value to divide?
86	{
87		val = ((~val)+1); //Convert the negative value to be positive!
88		remaindernegative = 1; //We're to toggle the remainder is any, because the value to divide is negative!
89	}
90	if (divisor&0x8000) //Negative divisor? Convert to a positive divisor!
91	{
92		divisor = ((~divisor)+1); //Convert the divisor to be positive!
93	}
94	CPU8086_internal_DIV(val,divisor,result,modulo,error,resultbits-1,SHLcycle,ADDSUBcycle,applycycles); //Execute the division as an unsigned division!
95	if (*error==0) //No error has occurred? Do post-processing of the results!
96	{
97		if (resultnegative) //The result is negative?
98		{
99			*result = (~*result)+1; //Apply the new sign to the result!
100		}
101		if (remaindernegative) //The remainder is negative?
102		{
103			*modulo = (~*modulo)+1; //Apply the new sign to the remainder!
104		}
105	}
106}

These seem to be working without problems now(after modifying the instructions calling them(8 and 16-bit using their correct sign extending etc.)). 😀

I've currently set all the SHLcycle to 2 and ADDSUBCYCLE to 3 in my testing. I think the ADDSUBCYCLE would need to be 6 instead(seeing as it does add and sub, which each take 3 cycles). The SHLcycle is still 2 when the shift of the substraction value and result is done in parallel(both shifting left at the same cycle). Otherwise it would need to be 4 instead. Do you guys know if the shift left of the two values is done in parallel(taking 2 cycles) or after each other(taking 4 cycles)?

Edit: Also, it's fine to use it. If you find out the exact timings for it, can you notify it in this thread?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 76 of 198, by superfury

Posted on 2017-04-24, 07:26

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

I've improved the names of the variables a bit, making a bit more clear what it's doing:

1//Universal DIV instruction for x86 DIV instructions!
2/*
3
4Parameters:
5	val: The value to divide
6	divisor: The value to divide by
7	quotient: Quotient result container
8	remainder: Remainder result container
9	error: 1 on error(DIV0), 0 when valid.
10	resultbits: The amount of bits the result contains(16 or 8 on 8086) of quotient and remainder.
11	SHLcycle: The amount of cycles for each SHL.
12	ADDSUBcycle: The amount of cycles for ADD&SUB instruction to execute.
13
14*/
15void CPU8086_internal_DIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle, byte *applycycles)
16{
17	uint_32 temp, temp2, currentquotient; //Remaining value and current divisor!
18	byte shift; //The shift to apply! No match on 0 shift is done!
19	temp = val; //Load the value to divide!
20	*applycycles = 1; //Default: apply the cycles normally!
21	if (divisor==0) //Not able to divide?
22	{
23		*quotient = 0;
24		*remainder = temp; //Unable to comply!
25		*error = 1; //Divide by 0 error!
26		return; //Abort: division by 0!
27	}
28
29	if (CPU_apply286cycles()) /* No 80286+ cycles instead? */
30	{
31		SHLcycle = ADDSUBcycle = 0; //Don't apply the cycle counts for this instruction!
32		*applycycles = 0; //Don't apply the cycles anymore!
33	}
34
35	temp = val; //Load the remainder to use!
36	*quotient = 0; //Default: we have nothing after division! 
37	nextstep:
38	//First step: calculate shift so that (divisor<<shift)<=remainder and ((divisor<<(shift+1))>remainder)
39	temp2 = divisor; //Load the default divisor for x1!
40	if (temp2>temp) //Not enough to divide? We're done!
41	{
42		goto gotresult; //We've gotten a result!
43	}
44	currentquotient = 1; //We're starting with x1 factor!
45	for (shift=0;shift<(resultbits+1);++shift) //Check for the biggest factor to apply(we're going from bit 0 to maxbit)!
46	{
47		if ((temp2<=temp) && ((temp2<<1)>temp)) //Found our value to divide?
48		{
49			CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 more SHL cycle for this!
50			break; //We've found our shift!
51		}
52		temp2 <<= 1; //Shift to the next position!
53		currentquotient <<= 1; //Shift to the next result!
54		CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 SHL cycle for this! Assuming parallel shifting!
55	}
56	if (shift==(resultbits+1)) //We've overflown? We're too large to divide!
57	{
58		*error = 1; //Raise divide by 0 error due to overflow!
59		return; //Abort!
60	}

…Show last 46 lines

61	//Second step: substract divisor<<n from remainder and increase result with 1<<n.
62	temp -= temp2; //Substract divisor<<n from remainder!
63	*quotient += currentquotient; //Increase result(divided value) with the found power of 2 (1<<n).
64	CPU[activeCPU].cycles_OP += ADDSUBcycle; //We're taking 1 substract and 1 addition cycle for this(ADD/SUB register take 3 cycles)!
65	goto nextstep; //Start the next step!
66	//Finished when remainder<divisor or remainder==0.
67	gotresult: //We've gotten a result!
68	if (temp>((1<<resultbits)-1)) //Modulo overflow?
69	{
70		*error = 1; //Raise divide by 0 error due to overflow!
71		return; //Abort!		
72	}
73	*remainder = temp; //Give the modulo! The result is already calculated!
74	*error = 0; //We're having a valid result!
75}
76
77void CPU8086_internal_IDIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle, byte *applycycles)
78{
79	byte quotientnegative, remaindernegative; //To toggle the result and apply sign after and before?
80	quotientnegative = remaindernegative = 0; //Default: don't toggle the result not remainder!
81	if (((val>>31)!=(divisor>>15))) //Are we to change signs on the result? The result is negative instead! (We're a +/- or -/+ division)
82	{
83		quotientnegative = 1; //We're to toggle the result sign if not zero!
84	}
85	if (val&0x80000000) //Negative value to divide?
86	{
87		val = ((~val)+1); //Convert the negative value to be positive!
88		remaindernegative = 1; //We're to toggle the remainder is any, because the value to divide is negative!
89	}
90	if (divisor&0x8000) //Negative divisor? Convert to a positive divisor!
91	{
92		divisor = ((~divisor)+1); //Convert the divisor to be positive!
93	}
94	CPU8086_internal_DIV(val,divisor,quotient,remainder,error,resultbits-1,SHLcycle,ADDSUBcycle,applycycles); //Execute the division as an unsigned division!
95	if (*error==0) //No error has occurred? Do post-processing of the results!
96	{
97		if (quotientnegative) //The result is negative?
98		{
99			*quotient = (~*quotient)+1; //Apply the new sign to the result!
100		}
101		if (remaindernegative) //The remainder is negative?
102		{
103			*remainder = (~*remainder)+1; //Apply the new sign to the remainder!
104		}
105	}
106}

It's working without problems now(It's currently given SHLcycle=2 and ADDSUBcycle=6 for it's timings), with 8088 MPH reporting 1631 cycles(3% off).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 77 of 198, by superfury

Posted on 2017-04-24, 10:12

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5466
Joined: 2014-03-08, 11:25
Location: Netherlands

Does 8088 MPH crash at the credits with CAPE as well? I've just tried CAPE 0.6 with a MS-DOS 3.3 boot disk and 8088 MPH final floppy on second drive(Turbo XT BIOS) and it didn't run very well at all(various problems all throughout the demo, scrolling(Start register effect) failing, 1K color screen hanging the demo, almost all screens messing up royally(is it emulating a VGA-style even through instructed to do CGA instead?). About the only things that somewhat worked before the crash were the part requiring SALC, the calibration screen(although the exact rendering failed too, displaying more like vertical stripes instead gradients) and the initial scrollover(with the car). That's far more off than I would've thought. Or is there a lot of work behind the scenes that's yet unreleased?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 78 of 198, by vladstamate

Posted on 2017-04-24, 12:10

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

No, there is no VGA emulation in CAPE. However I am working on a rewrite of all MDA/Herc/CGA/EGA parts using a 6845 emulator. Most of the current issues in CAPE with 8088MPH are related to CGA innacuracies which this re-write should solve.

Currently I have 4 individual emulations for each card however in the new scheme I have a 6845 emulator that each card can use. Still early but I am hoping to iron out the CGA issues in 8088mph with this.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 79 of 198, by Scali

Posted on 2017-04-24, 12:38

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

vladstamate wrote:
Currently I have 4 individual emulations for each card however in the new scheme I have a 6845 emulator that each card can use.

I wouldn't use that for EGA/VGA and newer.
Only MDA, CGA and Hercules use a real 6845.
EGA and VGA use their own CRTC, which is not entirely compatible with 6845. Some registers work slightly differently, or have entirely different functions.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Main menu

Common searches