I was thinking, if a 8086 divides by shifting and substracting, does it multiply by shifting and adding?
Result starts at zero.
So, for each bit in the multiplicand(and bit position):
- If not set, one cycle, check next bit.
- Otherwise(set multiplicant bit), shift number to multiply left by bit number(using a temporary register) and add the result to the final result. Process next multiplicant bit.
- Finish when finished bit 0 of the multiplicant(taking at least n cycles in total(n being the amount of bits in the multiplicant, e.g. 8 or 16 on a 8086).
When having finished bit 0, break the loop.
Since an ADD takes 2 cycles afaik(like SUB), it would take 16 cycles for a 8-bit multiplicand, being zero and 32 cycles for a 16-bit operand. Maybe less using an early-out algorithm? E.g. add a CMP between the bits comparing to 0 for an early-out algorithm?
Edit: Optimize it by shifting left the base and shifting right the multiplicand, adding the pre-shift base to the result if bit 0 of the pre-shift multiplicand is set until the multiplicand is zero(all set bits are processed and shifted out).
Last edited by superfury on 2018-10-04, 08:30. Edited 1 time in total.
Yes, that's essentially the multiplication algorithm as implemented in the 8088/8086 microcode. These chips do not implement the early-out and pre-shift optimisations that you describe.
I have actually now implemented cycle-exact emulation of all the multiplication and division algorithms for the 8088/8086 in my own emulator. The code for this can be found at https://github.com/reenigne/reenigne/blob/mas … ts/gentests.cpp lines 4741-4904. I also gave a talk last month in which I described some of my work in cycle-exact emulation (and, in particular, how hard it was to get the division algorithm right). There is a recording of this talk on YouTube which can be found at https://www.youtube.com/watch?v=BJR3rWZ-FYg . I'm afraid I've been neglecting my retrocomputing hobbies for the past few months but I hope to get back into it soon and finally get my emulator cycle-exact in all circumstances and make it actually usable for running real programs. In the meantime, the code as it is so far is available for use in other emulators.
Looking at said code for DIV looks a bit like my implementation of it, although my algorithm uses 32/64-bit variables instead(see unipcemu's cpu/opcodes_8086.c, function CPU8086_internal_DIV). I have a simple wrapper around that one for IDIV(which negates the divisor÷nt, as well as the results in much the same way(using 32-bit integers) as well as some extra post-processing overflow logic(using masks).
The (I)MUL method seem to do the same in your case, I'll do that too(easy to adjust IDIV to IMUL, just adding to write a MUL handler for it). Also, UniPCemu uses simple SUB,ADD,SHL timings instead of your method atm.
IDIV:
- Negating the divident takes 4 cycles(NEG). None yet in UniPCemu.
- Not negating the divisor takes 1 cycle(Not in UniPCemu).
- Nine cycles on top of those with IDIV?
Main operation (I)DIV:
- 3 for non-AAM opcode.
- 8 to mask?
- 1 cycle for non-AAM interrupt(always in UniPCemu)?
- 1 cycle for non-AAM again.
- 2 cycles for all divisions.
- 8 cycles for each bit from high to bit 0.
- 2 for final bit for highest divisor finding?
- 1 for each substraction.
Then for IDIV:
- 4 cycles.
- 1 during overflow.
- 7 for negating back.
Is that correct?
Last edited by superfury on 2018-10-04, 12:00. Edited 1 time in total.
I've taken a look at the (I)MUL as well, which seems more simple:
- Non-AAD byte takes 8 cycles.
- Negating takes 1 cycle(IMUL).
- Something with highest multiplier bit takes 1 cycle?
- Only negative source takes four cycles more?
- 10 cycles for GRP opcodes.
- 3 cycles for non-AAD opcodes.
Then the main multiplication loop, for each multiplier bit:
- 7 cycles for the whole shift/addition.
- 1 cycle for each set bit.
Finally, for negated inputs, 9 cycles more.
Is that correct too?
Last edited by superfury on 2018-10-04, 12:02. Edited 1 time in total.
superfury wrote:I've been looking at your code for (I)DIV: […] Show full quote
I've been looking at your code for (I)DIV:
IDIV:
- Negating the divident takes 4 cycles(NEG). None yet in UniPCemu.
- Not negating the divisor takes 1 cycle(Not in UniPCemu).
Right.
superfury wrote:
- Nine cycles on top of those with IDIV?
On top of those of DIV I think you mean.
superfury wrote:
Main operation (I)DIV:
- 3 for non-D4 opcode.
Yes, the D4 code path probably starts partway through the non-D4 code path in the microcode.
superfury wrote:
- 8 to mask?
I don't know exactly what those 8 are for - there's probably a few things which happen there (and masking itself probably doesn't take any cycles on the CPU - operand size seems to be implicit in most parts of the microcode). It's a catch-all for all the always-happens cycles before the overflow check.
superfury wrote:
- 1 cycle for non-D4 interrupt(always in UniPCemu)?
- 1 cycle for non-D4 again.
There may be a better way to handle that. If I find one I'll clean it up at some point.
superfury wrote:
- 2 cycles for all divisions.
As with the 8 cycle wait earlier, this is always-happens code that is after the overflow check.
superfury wrote:
- 8 cycles for each bit from high to bit 0.
- 2 for final bit for highest divisor finding?
- 1 for each substraction.
That's effectively it. The source of all those waits is much clearer when you understand the microcode program!
superfury wrote:Then for IDIV:
- 4 cycles.
- 1 during overflow.
- 7 for negating back. […] Show full quote
Then for IDIV:
- 4 cycles.
- 1 during overflow.
- 7 for negating back.
Is that correct?
That 1 cycle only seems to happen for a register operand. I'm not sure why that is yet.
superfury wrote:Perhaps those 2 cycles: […] Show full quote
Perhaps those 2 cycles:
- 2 for final bit for highest divisor finding?
- 1 for each substraction.
Maybe it's a result of:
- 1 for each substraction(SUB for the new remainder).
- 2 for a conditional jump taken, 0 otherwise(jump not taken).
Could that be correct?
It's a bit more complicated than that. The other 8 cycles are spread throughout the loop in the microcode. There are 6 possible paths through the code (3 which end up going on to the following iteration, 3 which terminate the loop). Here's my analysis of the division microcode from the 4363091 patent:
1CORD: (0x122). Division described 31.47-32.65 2 3CR S D Type a b F 4------------------------------------- 50 1 SUBT tmpa tmpa-tmpb 61 SIGMA no dest 4 MAXC none F set counter to 7 or 15. Set flags based on result of tmpa-tmpb (0-divisor) 72 5 NCY 7 if (no carry) goto division-by-zero 83 1 LRCY tmpc A left rotation with carry is then executed on TMPC. The result is that the set carry bit is left rotated into TMPC which initially held the dividend. 94 SIGMA tmpc 1 LRCY tmpa Similarly, the set carry bit is rotated into temporary register A which was previously zeroed 105 SIGMA tmpa 1 SUBT tmpa tmpa-tmpb 116 0 CY 13 The carry bit is tested, but will be zero on the first pass, since a zero bit will have been shifted into the carry bit from the zeroed contents of temporary register A. 127 SIGMA no dest F set flags based on tmpa-tmpb result 138 0 NCY 14 If the divisor TMPB is greater than the contents of TMPA, which has now been rotated, the carry bit will be set thus the conditional jump NCY and microcode instruction 8 will be passed through. 149 0 NCZ 3 short jump back to microcode instruction 3 together with the decrement of the internal counter within ALU 134. 1510 1 LRCY tmpc perform an additional left rotation with the carry on the contents of TMPC 1611 SIGMA tmpc 1712 SIGMA no dest 4 none RTN 1813 4 RCY none reset carry bit 1914 SIGMA tmpa 0 NCZ 3 The difference between the divisor and dividend will be put into TMPA. Return to 3 2015 0 UNC 10 Unconditional jump to 10
It might not make much sense without reading the patent several times...
The inner loop of the div() function in my emulator is my attempt at rendering the above in C++ while keeping the same timings, avoiding excess wait() calls and not using gotos.
First, it sets the carry flag. The remainder is loaded as l and h.
1 Word r = (l << 1) + (carry ? 1 : 0); 2 carry = topBit(l); 3 l = r; 4 r = (h << 1) + (carry ? 1 : 0); 5 carry = topBit(h); 6 h = r; 7 wait(8);
That's just a 32-bit RCL on the remainder?
1 if (carry) { 2 carry = false; 3 h -= _source; 4 if (b == bitCount - 1) 5 wait(2); 6 } 7 else { 8 carry = _source > h; 9 if (!carry) { 10 h -= _source; 11 wait(1); 12 if (b == bitCount - 1) 13 wait(2); 14 } 15 }
That one's a bit more difficult. Looking at it, it looks like some logic for the SUB part of long division, at least row 4801. Other than that, I'm a bit lost there. Of course I didn't know you could do it this way as well. I do know the basic SHL/SUB method, though.
1 l = ~((l << 1) + (carry ? 1 : 0));
Finally, after all 8/16-bits, a 16-bit RCL on the remainder for the resulting value?
Ah, I see you've posted that patent thing. I'll look at it later.
superfury wrote:Looking at your code: […] Show full quote
Looking at your code:
First, it sets the carry flag. The remainder is loaded as l and h.
1 Word r = (l << 1) + (carry ? 1 : 0); 2 carry = topBit(l); 3 l = r; 4 r = (h << 1) + (carry ? 1 : 0); 5 carry = topBit(h); 6 h = r; 7 wait(8);
That's just a 32-bit RCL on the remainder?
Or 16-bit, if the operand is 8-bit. As the remainder is rotated left, the bits that are rotated into the low bit of l are the inverted bits of the quotient, so the same rotate operates on both the intermediate remainders and intermediate quotients. This also saves having to have a fourth temporary register - it's extremely clever!
superfury wrote:That one's a bit more difficult. Looking at it, it looks like some logic for the SUB part of long division, at least row 4801. O […] Show full quote
1 if (carry) { 2 carry = false; 3 h -= _source; 4 if (b == bitCount - 1) 5 wait(2); 6 } 7 else { 8 carry = _source > h; 9 if (!carry) { 10 h -= _source; 11 wait(1); 12 if (b == bitCount - 1) 13 wait(2); 14 } 15 }
That one's a bit more difficult. Looking at it, it looks like some logic for the SUB part of long division, at least row 4801. Other than that, I'm a bit lost there. Of course I didn't know you could do it this way as well. I do know the basic SHL/SUB method, though.
It's a shift-and-subtract method like the one I outlined some time ago (long division usually refers to the algorithm for implementing multi-digit division in terms of single-digit division and is much more complicated). The optimizations made to minimize the size of the microcode and the amount of temporary register space needed obfuscate it a bit.
superfury wrote:2
Finally, after all 8/16-bits, a 16-bit RCL on the remainder for the resulting value? […] Show full quote
1 l = ~((l << 1) + (carry ? 1 : 0));
2
Finally, after all 8/16-bits, a 16-bit RCL on the remainder for the resulting value?
Yes (or 8-bit, for an 8-bit operand). This is step 10 in the microcode. The final value is also inverted, though this isn't shown in the microcode.
I've implemented the (I)DIV instruction based on your code, but according to the 80386 test suite it fails executing properly:
1//Universal DIV instruction for x86 DIV instructions! 2/* 3 4Parameters: 5 val: The value to divide 6 divisor: The value to divide by 7 quotient: Quotient result container 8 remainder: Remainder result container 9 error: 1 on error(DIV0), 0 when valid. 10 resultbits: The amount of bits the result contains(16 or 8 on 8086) of quotient and remainder. 11 applycycles: Apply the normal cycles? Start out as 1! 12 issigned: Signed division? 13 quotientnegative: Quotient is signed negative result? 14 remaindernegative: Remainder is signed negative result? 15 isAdjust: AAM/AAD=1, otherwise 0. 16 isSigned: 1 for IMUL, otherwise 0. 17 18*/ 19void CPU8086_internal_DIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte *applycycles, byte issigned, byte quotientnegative, byte remaindernegative, byte isAdjust, byte isSigned, byte isRegister) 20{ 21 if (*applycycles) 22 { 23 if (CPU_apply286cycles()) /* No 80286+ cycles instead? */ 24 { 25 *applycycles = 0; //Don't apply the cycles anymore! 26 } 27 } 28 29 if ((isAdjust == 0) && *applycycles) CPU[activeCPU].cycles_OP += 3; //3 cycles for non-AA*! 30 31 word l, h; //h=remainder, l=quotient! 32 word r; 33 byte carry; 34 byte c; 35 if (resultbits == 16) 36 { 37 h = ((val >> 16) & 0xFFFF); //High word or byte! 38 l = (val & 0xFFFF); //Low word or byte! 39 } 40 else //8-bit? 41 { 42 h = ((val >> 8) & 0xFF); //High word or byte! 43 l = (val & 0xFF); //Low word or byte! 44 } 45 if (*applycycles) CPU[activeCPU].cycles_OP += 8; //8 cycles! 46 if (h >= divisor) //Overflow? 47 { 48 if ((isAdjust == 0) && *applycycles) ++CPU[activeCPU].cycles_OP; //1 cycle! 49 *error = 1; //Divide by 0 error! 50 return; //Abort: overflow! 51 } 52 if (isAdjust == 0) ++CPU[activeCPU].cycles_OP; //1 cycle! 53 54 CPU[activeCPU].cycles_OP += 2; //2 cycles to prepare CLC! 55 carry = 0; //No carry! 56 for (c = 0; c < resultbits; ++c) 57 { 58 //Perform 32-bit or 16-bit RCL on both quotient and remainder! 59 r = (l << 1) | carry; //RCL! 60 carry = (l>>(resultbits - 1))&1; //Carry-out!
…Show last 85 lines
61 l = r; 62 r = (h << 1) | carry; //RCL! 63 carry = (h >> (resultbits - 1)) & 1; //Carry-out! 64 h = r; 65 66 if (*applycycles) CPU[activeCPU].cycles_OP = 8; //Takes 8 cycles! 67 68 if (carry) 69 { 70 carry = 0; //No carry anymore! 71 h -= divisor; //Substract! 72 if ((c == (resultbits - 1)) && *applycycles) CPU[activeCPU].cycles_OP += 2; //Takes 2 cycles on the final turn! 73 } 74 else //No carry? 75 { 76 carry = (divisor > h)?1:0; //Overflow? 77 if (carry == 0) 78 { 79 h -= divisor; //Substract! 80 if (*applycycles) ++CPU[activeCPU].cycles_OP; //Takes 1 cycle! 81 if ((c == (resultbits - 1)) && *applycycles) CPU[activeCPU].cycles_OP += 2; //Takes 2 cycles on the final bit! 82 } 83 } 84 } 85 l = ~((l << 1) | carry); //Final RCL and negate for the result! 86 87 if ((isAdjust == 0) && isSigned) //IDIV? 88 { 89 if (*applycycles) CPU[activeCPU].cycles_OP += 4; //Takes 4 cycles! 90 if (l&(1 << (resultbits-1))) //Overflow? 91 { 92 if (isRegister) ++CPU[activeCPU].cycles_OP; //Register takes 1 cycle? 93 *error = 1; //Throw division by 0! 94 return; 95 } 96 97 } 98 *remainder = h; //Remainder! 99 *quotient = l; //Quotient! 100} 101 102void CPU8086_internal_IDIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte *applycycles, byte isAdjust, byte isRegister) 103{ 104 byte quotientnegative, remaindernegative; //To toggle the result and apply sign after and before? 105 106 *applycycles = 1; //Default: apply cycles! 107 if (CPU_apply286cycles()) /* No 80286+ cycles instead? */ 108 { 109 *applycycles = 0; //Don't apply the cycles anymore! 110 } 111 112 quotientnegative = remaindernegative = 0; //Default: don't toggle the result not remainder! 113 if (((val>>31)!=(divisor>>15))) //Are we to change signs on the result? The result is negative instead! (We're a +/- or -/+ division) 114 { 115 quotientnegative = 1; //We're to toggle the result sign if not zero! 116 } 117 if (val&0x80000000) //Negative value to divide? 118 { 119 val = ((~val)+1); //Convert the negative value to be positive! 120 remaindernegative = 1; //We're to toggle the remainder is any, because the value to divide is negative! 121 if (*applycycles) CPU[activeCPU].cycles_OP += 4; //Takes 4 cycles! 122 } 123 if (divisor&0x8000) //Negative divisor? Convert to a positive divisor! 124 { 125 divisor = ((~divisor)+1); //Convert the divisor to be positive! 126 } 127 else 128 { 129 if (*applycycles) CPU[activeCPU].cycles_OP += 1; //Takes 1 cycles! 130 } 131 if (*applycycles) CPU[activeCPU].cycles_OP += 9; //Takes 9 cycles! 132 CPU8086_internal_DIV(val,divisor,quotient,remainder,error,resultbits,applycycles,1,quotientnegative,remaindernegative,isAdjust,1,isRegister); //Execute the division as an unsigned division! 133 if (*error==0) //No error has occurred? Do post-processing of the results! 134 { 135 if (*applycycles) CPU[activeCPU].cycles_OP += 7; //Takes 7 cycles! 136 if (quotientnegative) //The result is negative? 137 { 138 *quotient = (~*quotient)+1; //Apply the new sign to the result! 139 } 140 if (remaindernegative) //The remainder is negative? 141 { 142 *remainder = (~*remainder)+1; //Apply the new sign to the remainder! 143 } 144 } 145}
Edit: Whoops, error in AAM executing using 6-bit result instead of 8-bit result.
Edit: The 80386 testsuite by Barotto still fails on this.
Edit: After some fixes, it's running now... Waiting for results...
Edit: -128/-128 gives a divide error. The same with -0x8001/-0x8001 in the testsuite. Both because the MSB is set. That's row 4813 in your code.
Looking at your signed division overflow check, I see that negative max(-128 and -32768) count as overflow(due to bit 7/15 being set in the quotient), but they shouldn't(only with positive results(non-negated outputs)? Since +32768(16-bit) and +128(8-bit) and upwards are illegal indeed, but for negative results it's one further(-128 and -32768 are valid, anything more negative(positive at the point of checking) is invalid). Since the values are negated to perform positive division, the exception of -128 and -32768 should be taken into account.
I've also found out that somehow higher bits than bit7 are set during the main loop of the div instruction. Might be a CPU/compiler issue. I managed to fix that by moving the masking to 8/16-bits before checking for signed over/underflow.
Now the testsuite checks out correctly without errors. Also added the missing AAD multiplication, which of course runs correctly from the start.
Looking at your signed division overflow check, I see that negative max(-128 and -32768) count as overflow(due to bit 7/15 being set in the quotient), but they shouldn't(only with positive results(non-negated outputs)? Since +32768(16-bit) and +128(8-bit) and upwards are illegal indeed, but for negative results it's one further(-128 and -32768 are valid, anything more negative(positive at the point of checking) is invalid). Since the values are negated to perform positive division, the exception of -128 and -32768 should be taken into account.
I'll double-check my code against the real hardware in that case - it's possible that my testing missed some edge cases.
superfury wrote:
Btw, have you already gotten all cycle timing accurate for all instructions, or is it still a WIP?
Cycle timing is now accurate for every combination of instruction, prefix, data, initial bus state, queue state, refresh frequency and refresh phase that I've thrown at it so far (more than 9 million cases). I need to rework it so that it doesn't need to store all testcases in memory, then I'll be able to do a proper soak test which will be running for several weeks if it doesn't find any more bugs. Cycle timing may still be inaccurate in the face of hardware interrupts and/or multiple prefixes - I need to add those to the testsuite as well before I can truly call it a cycle-exact emulation of the 8088.
Tried the current emulation against various software. Now:
- Turbo XT BIOS doesn't run anymore(8088)? No video. Odd. 😖
Edit: CGA text mode confirmed.
Edit: BIOS POST confirmed.
On 80386(IPS cycle mode):
- Jazz Jackrabbit still crashes(unhandled interrupt fault B). 😒
- Doom finally runs, although extremly slow! 😁
- Checkit system board tests succeed! 😁
Running 8088@4.77MHz at 6% realtime speed on Android Samsung Galaxy S7.