VOGONS


8086 multiplication algorithm?

Topic actions

Reply 20 of 31, by superfury

User metadata
Rank l33t++
Rank
l33t++

8088MPH now reports 1562 metric cycles with the new (I)MUL/(I)DIV cycle counts implemented.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 21 of 31, by superfury

User metadata
Rank l33t++
Rank
l33t++

Based on your code, I've gotten the following cycles taken for all different ALU operations:

	switch (flags) //What type of operation?
{
case 0: //Reg+Reg?
CPU[activeCPU].cycles_OP += 3; //Reg->Reg!
break;
case 1: //Reg+imm?
CPU[activeCPU].cycles_OP += 1; //Accumulator!
break;
case 2: //Determined by ModR/M?
if (params.EA_cycles) //Memory is used?
{
if (dest) //Mem->Reg?
{
CPU[activeCPU].cycles_OP += 4; //Mem->Reg!
}
else //Reg->Mem?
{
CPU[activeCPU].cycles_OP += 3; //Mem->Reg!
}
}
else //Reg->Reg?
{
CPU[activeCPU].cycles_OP += 3; //Reg->Reg!
}
break;
case 3: //ModR/M+imm?
if (params.EA_cycles) //Memory is used?
{
if (dest) //Imm->Reg?
{
CPU[activeCPU].cycles_OP += 2; //Imm->Reg!
}
else //Imm->Mem?
{
CPU[activeCPU].cycles_OP += 5; //Mem->Reg!
}
}
else //Reg->Reg?
{
CPU[activeCPU].cycles_OP += 2; //Reg->Reg!
}
break;
default:
break;
}

Is that correct?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 22 of 31, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've modified all timings to match your code, as long as they don't result in 0 cycles on the EU side(?).

Something strikes me as odd: many jumps barely take time on the EU, effectively way faster than documented(e.g. 9 jump taken, 4 not taken. While 16 and 4 are documented officially)? Or is there something I'm missing?

https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c

Though I've simply added together those wait(n) statements 0f yours for UniPCemu's timings.

Result: Metric cycle count of 1444 on 8088MPH.
Edit: Whoops, forgot to add the 1 cycle for non-REP instructions, as well as the repeating instruction(repAction function in your code) to the cycles.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 23 of 31, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

I've modified all timings to match your code, as long as they don't result in 0 cycles on the EU side(?).

Something strikes me as odd: many jumps barely take time on the EU, effectively way faster than documented(e.g. 9 jump taken, 4 not taken. While 16 and 4 are documented officially)? Or is there something I'm missing?

If you have a single tight loop consisting of a taken conditional jump, it'll take 17 cycles to execute each iteration. 8 of those cycles will be fetching the next instruction since the prefetch queue will be cleared out each time, so 9 cycles is right for the EU time. I'm not sure what the reasoning behind the documented timing is.

Reply 24 of 31, by superfury

User metadata
Rank l33t++
Rank
l33t++

OK.

With the latest improvements(missing REP cycles and instruction startup cycle(1 cycle at the beginning of your handler), 8088 MPH now reports 1563 metric cycles. Still something is missing?

Edit: I do remember some of your instructions(like HLT) adding some cycles based on T-state and REP prefixes? Could that be it?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 25 of 31, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

OK.

With the latest improvements(missing REP cycles and instruction startup cycle(1 cycle at the beginning of your handler), 8088 MPH now reports 1563 metric cycles. Still something is missing?

Edit: I do remember some of your instructions(like HLT) adding some cycles based on T-state and REP prefixes? Could that be it?

Can't really tell without looking at cycle-by-cycle instruction traces from your emulator and XTCE (or real hardware). There is lot of complexity to the cycle timings, as you can see from gentests.cpp. Particularly where the interactions between the Execution Unit, Bus Interface Unit and prefetch queue come into play (i.e. where in the execution of each instruction each bus access starts). And there are still a lot of timings that I don't really understand why they are what they are (basically everything in the busInit() function, which can probably be drastically simpler once it's properly understood).

Btw, you probably should not expect 8088 MPH to run correctly even once you have reached 1678 cycles because the timings of the hardwareCheck test code can be correct without the individual instruction timings (and the Kefrens bars inner loop code) being correct. I'm working on making a much more thorough testsuite (based on all the tests that ever failed while I was making XTCE) which is what you'll really need!

Reply 26 of 31, by superfury

User metadata
Rank l33t++
Rank
l33t++

This is what I've changed in my 808X emulation with your timings: https://bitbucket.org/superfury/unipcemu/diff … 52c69371236334a

Can you see if that's correct? There are still 100-ish cycles missing somewhere(assuming 8088MPH reports cycle difference and not PIT counter difference?)

Do you know what instructions/opcodes 8088 MPH uses to count those cycles(what adds up to those cycles)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 27 of 31, by Scali

User metadata
Rank l33t
Rank
l33t

This is the hardware check code:

procedure hardwareCheck;
{
Performs a speed test and warns the user if they're about to run the demo on a
system it was not designed for (or an emulator). Speed test adapted from the
TOPBENCH opcode metric, however it is just an opcode exercise and is not meant
to be used for anything serious. "I'm a synthetic."

It took a bit of tuning and iteration to find something that produced
a consistent result!
}
const
foow=$1234; foob=$12; {dummy seed constants for the routines}
observed=1678; margin=10;
speedfound:boolean=false;
vidramloc:pointer=ptr($b800,0);

var
cycles:longint;

begin
{enterLockstep;} {locked up keyboard a few times in testing, will debug l8r}
asm

{ensure that we're on the default DRAM refresh, as our dev system
may have left us in some other state}
mov al,$54 {TIMER1 OR LSB OR MODE2 OR $00}
out 43h,al
mov al,18
out 41h,al
(*causing bugs, will debug later *)

cli
call _LZTimerOn

@@speedinit:
{Perform some CGA accesses so CGA wait states can align us}
(* disabled because it caused too much variance between samples
push ds
les di,vidramloc
lds si,vidramloc
cld
mov cx,64
{hit both aligned and unaligned}
lodsw
stosw
lodsb
stosb
movsb
movsw
rep lodsb
pop ds
*)

{init some regs to be non-empty, others in consistent state}
mov ax,foow
xor bx,bx
mov cx,bx
mov dx,$5678
mov si,bx
mov di,bx
Show last 447 lines

{Start exercising opcodes in roughly order of encoding.}
add ax,foow {accum, imm16}
add dx,foow {reg, imm16}
add al,foob {accum, imm8}
add dl,foob {reg, imm8}
add [w],ax {mem16, accum}
add [w],dx {mem16, reg}
add ax,[w] {accum, mem16}
add dx,[w] {reg, mem16}
add ax,dx {accum, reg}
add dx,ax {reg, accum}
add al,dl {accum, reg}
add dl,al {reg, accum}

push es
pop es

or ax,foow {accum, imm16}
or dx,foow {reg, imm16}
or al,foob {accum, imm8}
or dl,foob {reg, imm8}
or w,ax {mem16, accum}
or w,dx {mem16, reg}
or ax,w {accum, mem16}
or dx,w {reg, mem16}
or ax,dx {accum, reg}
or dx,ax {reg, accum}
or al,dl {accum, reg}
or dl,al {reg, accum}

push cs
pop es {POP CS is an undocumented opcode that works on
8088/8086, but we're not going to use it as it means
something completely different on later processors}

adc ax,foow {accum, imm16}
adc dx,foow {reg, imm16}
adc al,foob {accum, imm8}
adc dl,foob {reg, imm8}
adc w,ax {mem16, accum}
adc w,dx {mem16, reg}
adc ax,w {accum, mem16}
adc dx,w {reg, mem16}
adc ax,dx {accum, reg}
adc dx,ax {reg, accum}
adc al,dl {accum, reg}
adc dl,al {reg, accum}

push ax
mov ax,sp
push ss
pop ss {halts all interrupts including NMI for next instr.}
mov sp,ax
pop ax

sbb ax,foow {accum, imm16}
sbb dx,foow {reg, imm16}
sbb al,foob {accum, imm8}
sbb dl,foob {reg, imm8}
sbb w,ax {mem16, accum}
sbb w,dx {mem16, reg}
sbb ax,w {accum, mem16}
sbb dx,w {reg, mem16}
sbb ax,dx {accum, reg}
sbb dx,ax {reg, accum}
sbb al,dl {accum, reg}
sbb dl,al {reg, accum}

push ds
pop ds

and ax,foow {accum, imm16}
and dx,foow {reg, imm16}
and al,foob {accum, imm8}
and dl,foob {reg, imm8}
and w,ax {mem16, accum}
and w,dx {mem16, reg}
and ax,w {accum, mem16}
and dx,w {reg, mem16}
and ax,dx {accum, reg}
and dx,ax {reg, accum}
and al,dl {accum, reg}
and dl,al {reg, accum}

seges mov ax,[bx] {segment override ES is opcode 26h}

daa

sub ax,foow {accum, imm16}
sub dx,foow {reg, imm16}
sub al,foob {accum, imm8}
sub dl,foob {reg, imm8}
sub w,ax {mem16, accum}
sub w,dx {mem16, reg}
sub ax,w {accum, mem16}
sub dx,w {reg, mem16}
sub ax,dx {accum, reg}
sub dx,ax {reg, accum}
sub al,dl {accum, reg}
sub dl,al {reg, accum}

segcs mov ax,[bx] {segment override CS is opcode 2Eh}

das

xor ax,foow {accum, imm16}
xor dx,foow {reg, imm16}
xor al,foob {accum, imm8}
xor dl,foob {reg, imm8}
xor w,ax {mem16, accum}
xor w,dx {mem16, reg}
xor ax,w {accum, mem16}
xor dx,w {reg, mem16}
xor ax,dx {accum, reg}
xor dx,ax {reg, accum}
xor al,dl {accum, reg}
xor dl,al {reg, accum}

segss mov ax,[bx] {segment override SS is opcode 36h}

aaa

cmp ax,foow {accum, imm16}
cmp dx,foow {reg, imm16}
cmp al,foob {accum, imm8}
cmp dl,foob {reg, imm8}
cmp w,ax {mem16, accum}
cmp w,dx {mem16, reg}
cmp ax,w {accum, mem16}
cmp dx,w {reg, mem16}
cmp ax,dx {accum, reg}
cmp dx,ax {reg, accum}
cmp al,dl {accum, reg}
cmp dl,al {reg, accum}

segds lodsw {segment override DS is opcode 3Eh}

aas

inc ax
inc cx
inc dx
inc bx
inc si
inc di
dec ax
dec cx
dec dx
dec bx
dec si
dec di

push ax
push cx
push dx
push bx
push bp
push si
push di
pop di
pop si
pop bp
pop bx
pop dx
pop cx
pop ax

{Jcc and JMP tests -- timings are identical for most forms so we will
only test a few. jcxz is the only one with different timings so it is
explicitly tested as well.}
xor cx,cx {zero out cx}
dec cx {cx := -1}
stc {set carry flag}
jc @L1 {jump if carry - yes}
nop
@L1:
clc {clear carry flag}
jc @L1 {jump if carry - no}
inc cx
jcxz @L1 {jump if cx=0 - yes 1st pass, no 2nd}
sub cx,2
jmp @L3
@L2:
inc cx
clc
@L3:
jbe @L2 {jump if cf=1 or zf=1}
mov cx,2
@loopfun:
nop
loop @loopfun
@endofJMPtests:

{test has optimized forms for accumulator}
test ax,foow {accum, imm16}
test dx,foow {reg, imm16}
test al,foob {accum, imm8}
test dl,foob {reg, imm8}
test w,ax {mem16, accum}
test w,dx {mem16, reg}
test ax,w {accum, mem16}
test dx,w {reg, mem16}
test ax,dx {accum, reg}
test dx,ax {reg, accum}
test al,dl {accum, reg}
test dl,al {reg, accum}

lea ax,[w]

{8e mov segreg,rmw}
mov es,[bx+si+1234h]

nop

xchg w,ax {mem16, accum}
xchg w,dx {mem16, reg}
xchg ax,w {accum, mem16}
xchg dx,w {reg, mem16}
xchg ax,dx {accum, reg}
xchg dx,ax {reg, accum}
xchg al,dl {accum, reg}
xchg dl,al {reg, accum}

cbw

push ds
pop es
mov di,si {es:di = ds:si}
movsb
movsw
movsb
movsw
lodsb
stosb
lodsw
stosw
lodsb
stosb
lodsw
stosw {tests both aligned and unaligned moves}

cmpsb
cmpsw
cmpsb
cmpsw {aligned and unaligned}
scasb
scasw
scasb
scasw {aligned and unaligned}


mov al,foob
mov cl,foob
mov dl,foob
mov bl,foob
mov ah,foob
mov ch,foob
mov dh,foob
mov bh,foob
mov ax,foow
mov cx,foow
mov dx,foow
mov bx,foow
{A lot of hassle just to test the mov encodings of sp and bp :-P }
mov si,foow
mov di,foow

les bx,[foow]

mov bx,$FFFF
rol bl,1
rol [b],1
ror bl,1
ror [b],1
rcl bl,1
rcl [b],1
rcr bl,1
rcr [b],1
shl bl,1
shl [b],1
shr bl,1
shr [b],1
sal bl,1
sal [b],1
sar bl,1
sar [b],1
rol bx,1
rol [w],1
ror bx,1
ror [w],1
rcl bx,1
rcl [w],1
rcr bx,1
rcr [w],1
shl bx,1
shl [w],1
shr bx,1
shr [w],1
sal bx,1
sal [w],1
sar bx,1
sar [w],1

{Nybble work is common, so let's choose 4. Higher values could be used,
but can be optimized out (ie. rol al,5 = ror al,3) so we'll avoid them.}
mov cl,4
rol bl,cl
rol [b],cl
ror bl,cl
ror [b],cl
rcl bl,cl
rcl [b],cl
rcr bl,cl
rcr [b],cl
shl bl,cl
shl [b],cl
shr bl,cl
shr [b],cl
sal bl,cl
sal [b],cl
sar bl,cl
sar [b],cl
rol bx,cl
rol [w],cl
ror bx,cl
ror [w],cl
rcl bx,cl
rcl [w],cl
rcr bx,cl
rcr [w],cl
shl bx,cl
shl [w],cl
shr bx,cl
shr [w],cl
sal bx,cl
sal [w],cl
sar bx,cl
sar [w],cl

aad
nop {slightly more than 8088 prefetch queue}
nop
nop
nop
nop
nop
aam
nop {slightly more than 8088 prefetch queue}
nop
nop
nop
nop
nop
xlat

mov ax,foow
mov dx,$5678 {get non-zeros in registers again}

cmc

not dl
not ax
neg dl
neg ax

{mul/div tests. Values inspired by "PIT ticks to usec" conversion}
mov dx,8381
mul dx
mov bx,10000
div bx
nop {slightly more than 8088 prefetch queue}
nop
nop
nop
nop
nop
imul dx
nop {slightly more than 8088 prefetch queue}
nop
nop
nop
nop
nop
idiv bx

clc
stc

pushf
cld
std
popf

mov ax,foow {accum, imm16}
mov dx,foow {reg, imm16}
mov al,foob {accum, imm8}
mov dl,foob {reg, imm8}
mov w,ax {mem16, accum}
mov w,dx {mem16, reg}
mov ax,w {accum, mem16}
mov dx,w {reg, mem16}
mov ax,dx {accum, reg}
mov dx,ax {reg, accum}
mov al,dl {accum, reg}
mov dl,al {reg, accum}
{don't forget some segment overrides:}
mov dx,cs:[bx]
mov dx,ss:[bp]
mov dx,es:[si]
mov dx,ds:[di]

lea bx,vidramloc
push word ptr [bx]
pop word ptr [bx]

call _LZTimerOff
sti
end;
cycles:=_lztimercount;
if (cycles >= (observed-margin)) and (cycles <= (observed+margin))
then speedfound:=true
else speedfound:=false;
if not speedfound
then writeln('Metric cycle count of ',cycles,' deviates ',
round(abs((cycles-observed) / observed * 100)),'% from what we were expecting.');

writeln('4.77 MHz 8088: ',speedfound);

if speedfound then begin
{print message; wait 4 seconds}
writeln(#13#10'Thanks for running this on real hardware!');
cycles:=ticksSinceMidnight+(18*4);
repeat until ticksSinceMidnight>=cycles;
end else begin
writeln(#13#10'This system is not the intended target for this program.');
writeln('This demo only runs properly on a 4.77MHz 8088 with a real CGA card.');
writeln('Running it on anything else will at best look incorrect, and at worst');
writeln('may PERMANENTLY DAMAGE YOUR MONITOR. If you continue, you agree that');
writeln('the creators of this program cannot be held responsible for damages!'#13#10);
writeln('Sure you want to continue? (Y/N)');
repeat until not keypressed;
if upcase(readkeychar)<>'Y' then halt;
end;

end;

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 28 of 31, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

There are still 100-ish cycles missing somewhere(assuming 8088MPH reports cycle difference and not PIT counter difference?)

It reports PIT counter difference 😀

Reply 29 of 31, by Scali

User metadata
Rank l33t
Rank
l33t
reenigne wrote:

It reports PIT counter difference 😀

Yes, the _LZTimerOn/Off calls are an adaptation of Abrash' Zen Timer, which is baesd on the PIT timer: http://www.jagregory.com/abrash-zen-of-asm/#the-zen-timer

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 30 of 31, by superfury

User metadata
Rank l33t++
Rank
l33t++

Oddly enough, almost all of those instruction timings are already implemented according to Reenigne's code. Although I've skipped the instructions resulting in 0 cycles(no wait(n) function in your code), or instructions resulting in nonsensical cycles for a situation(look for the EU_CYCLES constants in my code for those). Although they're mostly jumps.

Those are:
- CMP(might have forgotten that one, or slightly unclear, thus unfinished).
- MOV modr/m memory to reg.
- PUSH/POP segreg.
- PUSH/POP reg.
- PUSHF/POPF.
- IN AL/AX,DX
- OUT DX,AL/AX.
- FF /4 JMP Ev memory.
- FF /5 JMP Mp memory.
- FF /6 PUSH Ev.

Any idea on those? Or have I simply forgotten those?

Edit: Fixed CMP. Was simply forgetting those it seems(easy enough to figure out, though).

So now all that's left is that MOV, those PUSHs and POPs, those IN/OUT accumulator,DX and the GRP5 JMP opcodes.

Edit: 8088 MPH now reports 1539 cycles. So 139 PIT cycles off(divide by 4 for 8088 cycles, so ~34 8088 cycles off, with 1.5 cycles allowed)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 31 of 31, by superfury

User metadata
Rank l33t++
Rank
l33t++

Can that hardware check code be ran from the boot sector at 0000:7c00? That way, we might be able to find offending timings for instructions(as well as having an easy hardware reference from a floppy(or the XT server)? Just place CLI HLT at the end to make it stop the CPU?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io