8086 multiplication algorithm? \ VOGONS

8086 multiplication algorithm?

Topic actions

First post, by superfury

Posted on 2018-10-04, 08:13

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

I was thinking, if a 8086 divides by shifting and substracting, does it multiply by shifting and adding?

Result starts at zero.

So, for each bit in the multiplicand(and bit position):
- If not set, one cycle, check next bit.
- Otherwise(set multiplicant bit), shift number to multiply left by bit number(using a temporary register) and add the result to the final result. Process next multiplicant bit.
- Finish when finished bit 0 of the multiplicant(taking at least n cycles in total(n being the amount of bits in the multiplicant, e.g. 8 or 16 on a 8086).

When having finished bit 0, break the loop.

Since an ADD takes 2 cycles afaik(like SUB), it would take 16 cycles for a 8-bit multiplicand, being zero and 32 cycles for a 16-bit operand. Maybe less using an early-out algorithm? E.g. add a CMP between the bits comparing to 0 for an early-out algorithm?

Edit: Optimize it by shifting left the base and shifting right the multiplicand, adding the pre-shift base to the result if bit 0 of the pre-shift multiplicand is set until the multiplicand is zero(all set bits are processed and shifted out).

Last edited by superfury on 2018-10-04, 08:30. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 31, by Scali

Posted on 2018-10-04, 08:29

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Yes, probably something like that.
I've done some articles on implementing multiply and division in software, for 6502/Z80 CPUs, which do not have this in hardware.
The algorithms will probably be very similar:
https://scalibq.wordpress.com/2013/03/20/just … t-real-part-10/
https://scalibq.wordpress.com/2013/03/24/just … real-part-10-1/

For multiply, I also explain the quarter-square method. Funny enough this is faster than the built-in multiply on 8088 in some cases, see also reenigne's blog:
https://www.reenigne.org/blog/multiplying-fas … r-with-squares/

Newer CPUs will have a small table in ROM to speed up multiplies, using a method similar to this.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 2 of 31, by reenigne

Posted on 2018-10-04, 10:25

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 673
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

Yes, that's essentially the multiplication algorithm as implemented in the 8088/8086 microcode. These chips do not implement the early-out and pre-shift optimisations that you describe.

I have actually now implemented cycle-exact emulation of all the multiplication and division algorithms for the 8088/8086 in my own emulator. The code for this can be found at https://github.com/reenigne/reenigne/blob/mas … ts/gentests.cpp lines 4741-4904. I also gave a talk last month in which I described some of my work in cycle-exact emulation (and, in particular, how hard it was to get the division algorithm right). There is a recording of this talk on YouTube which can be found at https://www.youtube.com/watch?v=BJR3rWZ-FYg . I'm afraid I've been neglecting my retrocomputing hobbies for the past few months but I hope to get back into it soon and finally get my emulator cycle-exact in all circumstances and make it actually usable for running real programs. In the meantime, the code as it is so far is available for use in other emulators.

Reply 3 of 31, by superfury

Posted on 2018-10-04, 11:01

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

Looking at said code for DIV looks a bit like my implementation of it, although my algorithm uses 32/64-bit variables instead(see unipcemu's cpu/opcodes_8086.c, function CPU8086_internal_DIV). I have a simple wrapper around that one for IDIV(which negates the divisor&divident, as well as the results in much the same way(using 32-bit integers) as well as some extra post-processing overflow logic(using masks).

The (I)MUL method seem to do the same in your case, I'll do that too(easy to adjust IDIV to IMUL, just adding to write a MUL handler for it). Also, UniPCemu uses simple SUB,ADD,SHL timings instead of your method atm.

https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c line 2179.

SHL taking 2 cycles and ADD&SUB taking 6 cycles combined.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 31, by superfury

Posted on 2018-10-04, 11:36

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

I've been looking at your code for (I)DIV:

IDIV:
- Negating the divident takes 4 cycles(NEG). None yet in UniPCemu.
- Not negating the divisor takes 1 cycle(Not in UniPCemu).
- Nine cycles on top of those with IDIV?

Main operation (I)DIV:
- 3 for non-AAM opcode.
- 8 to mask?
- 1 cycle for non-AAM interrupt(always in UniPCemu)?
- 1 cycle for non-AAM again.
- 2 cycles for all divisions.
- 8 cycles for each bit from high to bit 0.
- 2 for final bit for highest divisor finding?
- 1 for each substraction.

Then for IDIV:
- 4 cycles.
- 1 during overflow.
- 7 for negating back.

Is that correct?

Last edited by superfury on 2018-10-04, 12:00. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 31, by superfury

Posted on 2018-10-04, 11:58

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

I've taken a look at the (I)MUL as well, which seems more simple:
- Non-AAD byte takes 8 cycles.
- Negating takes 1 cycle(IMUL).
- Something with highest multiplier bit takes 1 cycle?
- Only negative source takes four cycles more?
- 10 cycles for GRP opcodes.
- 3 cycles for non-AAD opcodes.

Then the main multiplication loop, for each multiplier bit:
- 7 cycles for the whole shift/addition.
- 1 cycle for each set bit.

Finally, for negated inputs, 9 cycles more.

Is that correct too?

Last edited by superfury on 2018-10-04, 12:02. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 31, by reenigne

Posted on 2018-10-04, 12:01

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 673
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote:
I've been looking at your code for (I)DIV: […]
Show full quote
I've been looking at your code for (I)DIV:

IDIV:
- Negating the divident takes 4 cycles(NEG). None yet in UniPCemu.
- Not negating the divisor takes 1 cycle(Not in UniPCemu).

Right.

superfury wrote:
- Nine cycles on top of those with IDIV?

On top of those of DIV I think you mean.

superfury wrote:
Main operation (I)DIV:
- 3 for non-D4 opcode.

Yes, the D4 code path probably starts partway through the non-D4 code path in the microcode.

superfury wrote:
- 8 to mask?

I don't know exactly what those 8 are for - there's probably a few things which happen there (and masking itself probably doesn't take any cycles on the CPU - operand size seems to be implicit in most parts of the microcode). It's a catch-all for all the always-happens cycles before the overflow check.

superfury wrote:
- 1 cycle for non-D4 interrupt(always in UniPCemu)?
- 1 cycle for non-D4 again.

There may be a better way to handle that. If I find one I'll clean it up at some point.

superfury wrote:
- 2 cycles for all divisions.

As with the 8 cycle wait earlier, this is always-happens code that is after the overflow check.

superfury wrote:
- 8 cycles for each bit from high to bit 0.
- 2 for final bit for highest divisor finding?
- 1 for each substraction.

That's effectively it. The source of all those waits is much clearer when you understand the microcode program!

superfury wrote:
Then for IDIV: - 4 cycles. - 1 during overflow. - 7 for negating back. […]
Show full quote
Then for IDIV:
- 4 cycles.
- 1 during overflow.
- 7 for negating back.

Is that correct?

That 1 cycle only seems to happen for a register operand. I'm not sure why that is yet.

Reply 7 of 31, by reenigne

Posted on 2018-10-04, 12:10

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 673
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote:
I've taken a look at the MUL, which seems more simple:
- Non-AAD byte takes 8 cycles.

Only for byte-sized operands, for some reason.

superfury wrote:
- Negating takes 1 cycle(IMUL).

Right.

superfury wrote:
- Something with highest multiplier bit takes 1 cycle?

Hmm... that condition doesn't seem like it will ever be true. I will check that one, thanks.

superfury wrote:
- Only negative source takes four cycles more?

Right.

superfury wrote:
- 10 cycles for GRP opcodes.

For IMUL.

superfury wrote:
- 3 cycles for non-AAD opcodes.

Right (again probably AAD jumps into the routine somewhere other than the top).

superfury wrote:
Then the main multiplication loop, for each multiplier bit:
- 7 cycles for the whole shift/addition.
- 1 cycle for each set bit.

Right.

superfury wrote:
Finally, for negated inputs, 9 cycles more.

Is that correct too?

Yes (note that "negate" only ever gets set to true for IMUL).

Reply 8 of 31, by superfury

Posted on 2018-10-04, 13:43

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

Perhaps those 2 cycles:

- 2 for final bit for highest divisor finding?
- 1 for each substraction.

Maybe it's a result of:
- 1 for each substraction(SUB for the new remainder).
- 2 for a conditional jump taken, 0 otherwise(jump not taken).

Could that be correct?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 31, by reenigne

Posted on 2018-10-04, 14:51

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 673
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote:
Perhaps those 2 cycles: […]
Show full quote
Perhaps those 2 cycles:

- 2 for final bit for highest divisor finding?
- 1 for each substraction.

Maybe it's a result of:
- 1 for each substraction(SUB for the new remainder).
- 2 for a conditional jump taken, 0 otherwise(jump not taken).

Could that be correct?

It's a bit more complicated than that. The other 8 cycles are spread throughout the loop in the microcode. There are 6 possible paths through the code (3 which end up going on to the following iteration, 3 which terminate the loop). Here's my analysis of the division microcode from the 4363091 patent:

1CORD: (0x122). Division described 31.47-32.65
2
3CR  S      D      Type  a     b     F
4-------------------------------------
50                 1     SUBT  tmpa         tmpa-tmpb
61   SIGMA no dest 4     MAXC  none  F      set counter to 7 or 15. Set flags based on result of tmpa-tmpb (0-divisor)
72                 5     NCY   7            if (no carry) goto division-by-zero
83                 1     LRCY  tmpc         A left rotation with carry is then executed on TMPC. The result is that the set carry bit is left rotated into TMPC which initially held the dividend.
94   SIGMA  tmpc   1     LRCY  tmpa         Similarly, the set carry bit is rotated into temporary register A which was previously zeroed
105   SIGMA  tmpa   1     SUBT  tmpa         tmpa-tmpb
116                 0     CY    13           The carry bit is tested, but will be zero on the first pass, since a zero bit will have been shifted into the carry bit from the zeroed contents of temporary register A.
127   SIGMA no dest                   F      set flags based on tmpa-tmpb result
138                 0     NCY   14           If the divisor TMPB is greater than the contents of TMPA, which has now been rotated, the carry bit will be set thus the conditional jump NCY and microcode instruction 8 will be passed through.
149                 0     NCZ   3            short jump back to microcode instruction 3 together with the decrement of the internal counter within ALU 134.
1510                1     LRCY  tmpc         perform an additional left rotation with the carry on the contents of TMPC
1611  SIGMA  tmpc
1712  SIGMA no dest 4     none  RTN
1813                4     RCY   none         reset carry bit
1914  SIGMA  tmpa   0     NCZ   3            The difference between the divisor and dividend will be put into TMPA. Return to 3
2015                0     UNC   10           Unconditional jump to 10

It might not make much sense without reading the patent several times...

The inner loop of the div() function in my emulator is my attempt at rendering the above in C++ while keeping the same timings, avoiding excess wait() calls and not using gotos.

Reply 10 of 31, by superfury

Posted on 2018-10-04, 14:58

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

Looking at your code:

First, it sets the carry flag. The remainder is loaded as l and h.

1            Word r = (l << 1) + (carry ? 1 : 0);
2            carry = topBit(l);
3            l = r;
4            r = (h << 1) + (carry ? 1 : 0);
5            carry = topBit(h);
6            h = r;
7            wait(8);

That's just a 32-bit RCL on the remainder?

1            if (carry) {
2                carry = false;
3                h -= _source;
4                if (b == bitCount - 1)
5                    wait(2);
6            }
7            else {
8                carry = _source > h;
9                if (!carry) {
10                    h -= _source;
11                    wait(1);
12                    if (b == bitCount - 1)
13                        wait(2);
14                }
15            }

That one's a bit more difficult. Looking at it, it looks like some logic for the SUB part of long division, at least row 4801. Other than that, I'm a bit lost there. Of course I didn't know you could do it this way as well. I do know the basic SHL/SUB method, though.

1        l = ~((l << 1) + (carry ? 1 : 0));

Finally, after all 8/16-bits, a 16-bit RCL on the remainder for the resulting value?

Ah, I see you've posted that patent thing. I'll look at it later.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 11 of 31, by reenigne

Posted on 2018-10-04, 15:43

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 673
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote:
Looking at your code: […]
Show full quote
Looking at your code:

First, it sets the carry flag. The remainder is loaded as l and h.
1            Word r = (l << 1) + (carry ? 1 : 0);
2            carry = topBit(l);
3            l = r;
4            r = (h << 1) + (carry ? 1 : 0);
5            carry = topBit(h);
6            h = r;
7            wait(8);
That's just a 32-bit RCL on the remainder?

Or 16-bit, if the operand is 8-bit. As the remainder is rotated left, the bits that are rotated into the low bit of l are the inverted bits of the quotient, so the same rotate operates on both the intermediate remainders and intermediate quotients. This also saves having to have a fourth temporary register - it's extremely clever!

superfury wrote:
That one's a bit more difficult. Looking at it, it looks like some logic for the SUB part of long division, at least row 4801. O […]
Show full quote
1            if (carry) {
2                carry = false;
3                h -= _source;
4                if (b == bitCount - 1)
5                    wait(2);
6            }
7            else {
8                carry = _source > h;
9                if (!carry) {
10                    h -= _source;
11                    wait(1);
12                    if (b == bitCount - 1)
13                        wait(2);
14                }
15            }
That one's a bit more difficult. Looking at it, it looks like some logic for the SUB part of long division, at least row 4801. Other than that, I'm a bit lost there. Of course I didn't know you could do it this way as well. I do know the basic SHL/SUB method, though.

It's a shift-and-subtract method like the one I outlined some time ago (long division usually refers to the algorithm for implementing multi-digit division in terms of single-digit division and is much more complicated). The optimizations made to minimize the size of the microcode and the amount of temporary register space needed obfuscate it a bit.

superfury wrote:
2 Finally, after all 8/16-bits, a 16-bit RCL on the remainder for the resulting value? […]
Show full quote
1        l = ~((l << 1) + (carry ? 1 : 0));
2
Finally, after all 8/16-bits, a 16-bit RCL on the remainder for the resulting value?

Yes (or 8-bit, for an 8-bit operand). This is step 10 in the microcode. The final value is also inverted, though this isn't shown in the microcode.

Reply 12 of 31, by superfury

Posted on 2018-10-04, 16:59

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

I've implemented the (I)DIV instruction based on your code, but according to the 80386 test suite it fails executing properly:

1//Universal DIV instruction for x86 DIV instructions!
2/*
3
4Parameters:
5	val: The value to divide
6	divisor: The value to divide by
7	quotient: Quotient result container
8	remainder: Remainder result container
9	error: 1 on error(DIV0), 0 when valid.
10	resultbits: The amount of bits the result contains(16 or 8 on 8086) of quotient and remainder.
11	applycycles: Apply the normal cycles? Start out as 1!
12	issigned: Signed division?
13	quotientnegative: Quotient is signed negative result?
14	remaindernegative: Remainder is signed negative result?
15	isAdjust: AAM/AAD=1, otherwise 0.
16	isSigned: 1 for IMUL, otherwise 0.
17
18*/
19void CPU8086_internal_DIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte *applycycles, byte issigned, byte quotientnegative, byte remaindernegative, byte isAdjust, byte isSigned, byte isRegister)
20{
21	if (*applycycles)
22	{
23		if (CPU_apply286cycles()) /* No 80286+ cycles instead? */
24		{
25			*applycycles = 0; //Don't apply the cycles anymore!
26		}
27	}
28
29	if ((isAdjust == 0) && *applycycles) CPU[activeCPU].cycles_OP += 3; //3 cycles for non-AA*!
30
31	word l, h; //h=remainder, l=quotient!
32	word r;
33	byte carry;
34	byte c;
35	if (resultbits == 16)
36	{
37		h = ((val >> 16) & 0xFFFF); //High word or byte!
38		l = (val & 0xFFFF); //Low word or byte!
39	}
40	else //8-bit?
41	{
42		h = ((val >> 8) & 0xFF); //High word or byte!
43		l = (val & 0xFF); //Low word or byte!
44	}
45	if (*applycycles) CPU[activeCPU].cycles_OP += 8; //8 cycles!
46	if (h >= divisor) //Overflow?
47	{
48		if ((isAdjust == 0) && *applycycles) ++CPU[activeCPU].cycles_OP; //1 cycle!
49		*error = 1; //Divide by 0 error!
50		return; //Abort: overflow!
51	}
52	if (isAdjust == 0) ++CPU[activeCPU].cycles_OP; //1 cycle!
53
54	CPU[activeCPU].cycles_OP += 2; //2 cycles to prepare CLC!
55	carry = 0; //No carry!
56	for (c = 0; c < resultbits; ++c)
57	{
58		//Perform 32-bit or 16-bit RCL on both quotient and remainder!
59		r = (l << 1) | carry; //RCL!
60		carry = (l>>(resultbits - 1))&1; //Carry-out!

…Show last 85 lines

61		l = r;
62		r = (h << 1) | carry; //RCL!
63		carry = (h >> (resultbits - 1)) & 1; //Carry-out!
64		h = r;
65
66		if (*applycycles) CPU[activeCPU].cycles_OP = 8; //Takes 8 cycles!
67
68		if (carry)
69		{
70			carry = 0; //No carry anymore!
71			h -= divisor; //Substract!
72			if ((c == (resultbits - 1)) && *applycycles) CPU[activeCPU].cycles_OP += 2; //Takes 2 cycles on the final turn!
73		}
74		else //No carry?
75		{
76			carry = (divisor > h)?1:0; //Overflow?
77			if (carry == 0)
78			{
79				h -= divisor; //Substract!
80				if (*applycycles) ++CPU[activeCPU].cycles_OP; //Takes 1 cycle!
81				if ((c == (resultbits - 1)) && *applycycles) CPU[activeCPU].cycles_OP += 2; //Takes 2 cycles on the final bit!
82			}
83		}
84	}
85	l = ~((l << 1) | carry); //Final RCL and negate for the result!
86
87	if ((isAdjust == 0) && isSigned) //IDIV?
88	{
89		if (*applycycles) CPU[activeCPU].cycles_OP += 4; //Takes 4 cycles!
90		if (l&(1 << (resultbits-1))) //Overflow?
91		{
92			if (isRegister) ++CPU[activeCPU].cycles_OP; //Register takes 1 cycle?
93			*error = 1; //Throw division by 0!
94			return;
95		}
96
97	}
98	*remainder = h; //Remainder!
99	*quotient = l; //Quotient!
100}
101
102void CPU8086_internal_IDIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte *applycycles, byte isAdjust, byte isRegister)
103{
104	byte quotientnegative, remaindernegative; //To toggle the result and apply sign after and before?
105
106	*applycycles = 1; //Default: apply cycles!
107	if (CPU_apply286cycles()) /* No 80286+ cycles instead? */
108	{
109		*applycycles = 0; //Don't apply the cycles anymore!
110	}
111
112	quotientnegative = remaindernegative = 0; //Default: don't toggle the result not remainder!
113	if (((val>>31)!=(divisor>>15))) //Are we to change signs on the result? The result is negative instead! (We're a +/- or -/+ division)
114	{
115		quotientnegative = 1; //We're to toggle the result sign if not zero!
116	}
117	if (val&0x80000000) //Negative value to divide?
118	{
119		val = ((~val)+1); //Convert the negative value to be positive!
120		remaindernegative = 1; //We're to toggle the remainder is any, because the value to divide is negative!
121		if (*applycycles) CPU[activeCPU].cycles_OP += 4; //Takes 4 cycles!
122	}
123	if (divisor&0x8000) //Negative divisor? Convert to a positive divisor!
124	{
125		divisor = ((~divisor)+1); //Convert the divisor to be positive!
126	}
127	else
128	{
129		if (*applycycles) CPU[activeCPU].cycles_OP += 1; //Takes 1 cycles!
130	}
131	if (*applycycles) CPU[activeCPU].cycles_OP += 9; //Takes 9 cycles!
132	CPU8086_internal_DIV(val,divisor,quotient,remainder,error,resultbits,applycycles,1,quotientnegative,remaindernegative,isAdjust,1,isRegister); //Execute the division as an unsigned division!
133	if (*error==0) //No error has occurred? Do post-processing of the results!
134	{
135		if (*applycycles) CPU[activeCPU].cycles_OP += 7; //Takes 7 cycles!
136		if (quotientnegative) //The result is negative?
137		{
138			*quotient = (~*quotient)+1; //Apply the new sign to the result!
139		}
140		if (remaindernegative) //The remainder is negative?
141		{
142			*remainder = (~*remainder)+1; //Apply the new sign to the remainder!
143		}
144	}
145}

Edit: Whoops, error in AAM executing using 6-bit result instead of 8-bit result.
Edit: The 80386 testsuite by Barotto still fails on this.
Edit: After some fixes, it's running now... Waiting for results...

Edit: -128/-128 gives a divide error. The same with -0x8001/-0x8001 in the testsuite. Both because the MSB is set. That's row 4813 in your code.

1if (topBit(l)) {

Shouldn't that be:

1if (topBit(l) && (!negative)) {

Running the 80386 testsuite gives me:

1--- Z:\MinGW64\msys\home\Tim\test386.asm\test386-EE-reference.txt	2018-08-23 00:14:56.223617200 +0200
2+++ porte9.log	2018-10-04 20:45:26.140770700 +0200
3@@ -33115,7 +33115,7 @@
4 F6 IDIV B EAX=00000080 EDX=00000080 PS=0000 EAX=000000FF EDX=00000080 PS=0000 
5 F6 IDIV B EAX=00000080 EDX=00000081 PS=0000 EAX=000001FF EDX=00000081 PS=0000 
6 F6 IDIV B EAX=00000080 EDX=000000FE PS=0000 EAX=000000C0 EDX=000000FE PS=0000 
7-F6 IDIV B EAX=00000080 EDX=000000FF PS=0000 EAX=00000080 EDX=000000FF PS=0000 
8+F6 IDIV B EAX=00000080 EDX=000000FF PS=0000 #DE EAX=00000080 EDX=000000FF PS=0000 
9 F6 IDIV B EAX=00000081 EDX=00000000 PS=0000 #DE EAX=00000081 EDX=00000000 PS=0000 
10 F6 IDIV B EAX=00000081 EDX=00000001 PS=0000 #DE EAX=00000081 EDX=00000001 PS=0000 
11 F6 IDIV B EAX=00000081 EDX=00000002 PS=0000 EAX=00000140 EDX=00000002 PS=0000 
12@@ -35687,7 +35687,7 @@
13 F7 IDIV W EAX=00008001 EDX=00000000 PS=0000 EAX=0000FFFF EDX=00000002 PS=0000 
14 F7 IDIV W EAX=00008001 EDX=00000001 PS=0000 EAX=0000FFFD EDX=00000004 PS=0000 
15 F7 IDIV W EAX=00008001 EDX=00000002 PS=0000 EAX=0000FFFB EDX=00000006 PS=0000 
16-F7 IDIV W EAX=00008001 EDX=00003FFF PS=0000 EAX=00008000 EDX=00000001 PS=0000 
17+F7 IDIV W EAX=00008001 EDX=00003FFF PS=0000 #DE EAX=00008001 EDX=00003FFF PS=0000 
18 F7 IDIV W EAX=00008001 EDX=00004000 PS=0000 #DE EAX=00008001 EDX=00004000 PS=0000 
19 F7 IDIV W EAX=00008001 EDX=00004001 PS=0000 #DE EAX=00008001 EDX=00004001 PS=0000 
20 F7 IDIV W EAX=00008001 EDX=00007FFE PS=0000 #DE EAX=00008001 EDX=00007FFE PS=0000

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 13 of 31, by superfury

Posted on 2018-10-05, 10:22

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

Can you see what's going wrong with my implementation of your algorithm? It's the DIV followed by the IDIV variant of it(which negates etc.)

https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c line 2180.

Edit: Whoops, initial carry must be set. Still doesn't fix divide 0x80 and 0x8001 logs, though.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 14 of 31, by superfury

Posted on 2018-10-05, 17:45

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

After a bit of work I at least got the (I)MUL working on another branch of UniPCemu, now merged into the main branch 😁

All that's left is the cycle-accurate (I)DIV branch.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 15 of 31, by superfury

Posted on 2018-10-05, 22:47

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

Looking at your signed division overflow check, I see that negative max(-128 and -32768) count as overflow(due to bit 7/15 being set in the quotient), but they shouldn't(only with positive results(non-negated outputs)? Since +32768(16-bit) and +128(8-bit) and upwards are illegal indeed, but for negative results it's one further(-128 and -32768 are valid, anything more negative(positive at the point of checking) is invalid). Since the values are negated to perform positive division, the exception of -128 and -32768 should be taken into account.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 16 of 31, by superfury

Posted on 2018-10-06, 01:42

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

I've also found out that somehow higher bits than bit7 are set during the main loop of the div instruction. Might be a CPU/compiler issue. I managed to fix that by moving the masking to 8/16-bits before checking for signed over/underflow.

Now the testsuite checks out correctly without errors. Also added the missing AAD multiplication, which of course runs correctly from the start.

https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c

The starting of the (I)MUL/(I)DIV is at the same location as mentioned before(although might be a few lines lower due to documentation changes).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 17 of 31, by superfury

Posted on 2018-10-06, 07:51

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

Btw, have you already gotten all cycle timing accurate for all instructions, or is it still a WIP?

Edit: Whoops, forgot to disable the old timings.
https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c

Your algorithms are at line 2183 onwards(first (I)DIV, then (I)MUL. Confirmed working on the test386.asm testsuite.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 18 of 31, by reenigne

Posted on 2018-10-06, 09:34

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 673
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

superfury wrote:
Looking at your signed division overflow check, I see that negative max(-128 and -32768) count as overflow(due to bit 7/15 being set in the quotient), but they shouldn't(only with positive results(non-negated outputs)? Since +32768(16-bit) and +128(8-bit) and upwards are illegal indeed, but for negative results it's one further(-128 and -32768 are valid, anything more negative(positive at the point of checking) is invalid). Since the values are negated to perform positive division, the exception of -128 and -32768 should be taken into account.

I'll double-check my code against the real hardware in that case - it's possible that my testing missed some edge cases.

superfury wrote:
Btw, have you already gotten all cycle timing accurate for all instructions, or is it still a WIP?

Cycle timing is now accurate for every combination of instruction, prefix, data, initial bus state, queue state, refresh frequency and refresh phase that I've thrown at it so far (more than 9 million cases). I need to rework it so that it doesn't need to store all testcases in memory, then I'll be able to do a proper soak test which will be running for several weeks if it doesn't find any more bugs. Cycle timing may still be inaccurate in the face of hardware interrupts and/or multiple prefixes - I need to add those to the testsuite as well before I can truly call it a cycle-exact emulation of the 8088.

Reply 19 of 31, by superfury

Posted on 2018-10-07, 15:17

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 6032
Joined: 2014-03-08, 11:25
Location: Netherlands

Tried the current emulation against various software. Now:
- Turbo XT BIOS doesn't run anymore(8088)? No video. Odd. 😖
Edit: CGA text mode confirmed.
Edit: BIOS POST confirmed.

On 80386(IPS cycle mode):
- Jazz Jackrabbit still crashes(unhandled interrupt fault B). 😒
- Doom finally runs, although extremly slow! 😁
- Checkit system board tests succeed! 😁

Running 8088@4.77MHz at 6% realtime speed on Android Samsung Galaxy S7.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Go to top of page Go to top of page

Back to PC Emulation