Reply 20 of 31, by superfury
8088MPH now reports 1562 metric cycles with the new (I)MUL/(I)DIV cycle counts implemented.
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io
8088MPH now reports 1562 metric cycles with the new (I)MUL/(I)DIV cycle counts implemented.
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io
Based on your code, I've gotten the following cycles taken for all different ALU operations:
switch (flags) //What type of operation?{case 0: //Reg+Reg?CPU[activeCPU].cycles_OP += 3; //Reg->Reg!break;case 1: //Reg+imm?CPU[activeCPU].cycles_OP += 1; //Accumulator!break;case 2: //Determined by ModR/M?if (params.EA_cycles) //Memory is used?{if (dest) //Mem->Reg?{CPU[activeCPU].cycles_OP += 4; //Mem->Reg!}else //Reg->Mem?{CPU[activeCPU].cycles_OP += 3; //Mem->Reg!}}else //Reg->Reg?{CPU[activeCPU].cycles_OP += 3; //Reg->Reg!}break;case 3: //ModR/M+imm?if (params.EA_cycles) //Memory is used?{if (dest) //Imm->Reg?{CPU[activeCPU].cycles_OP += 2; //Imm->Reg!}else //Imm->Mem?{CPU[activeCPU].cycles_OP += 5; //Mem->Reg!}}else //Reg->Reg?{CPU[activeCPU].cycles_OP += 2; //Reg->Reg!}break;default:break;}
Is that correct?
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io
I've modified all timings to match your code, as long as they don't result in 0 cycles on the EU side(?).
Something strikes me as odd: many jumps barely take time on the EU, effectively way faster than documented(e.g. 9 jump taken, 4 not taken. While 16 and 4 are documented officially)? Or is there something I'm missing?
https://bitbucket.org/superfury/unipcemu/src/ … /opcodes_8086.c
Though I've simply added together those wait(n) statements 0f yours for UniPCemu's timings.
Result: Metric cycle count of 1444 on 8088MPH.
Edit: Whoops, forgot to add the 1 cycle for non-REP instructions, as well as the repeating instruction(repAction function in your code) to the cycles.
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io
wrote:I've modified all timings to match your code, as long as they don't result in 0 cycles on the EU side(?).
Something strikes me as odd: many jumps barely take time on the EU, effectively way faster than documented(e.g. 9 jump taken, 4 not taken. While 16 and 4 are documented officially)? Or is there something I'm missing?
If you have a single tight loop consisting of a taken conditional jump, it'll take 17 cycles to execute each iteration. 8 of those cycles will be fetching the next instruction since the prefetch queue will be cleared out each time, so 9 cycles is right for the EU time. I'm not sure what the reasoning behind the documented timing is.
OK.
With the latest improvements(missing REP cycles and instruction startup cycle(1 cycle at the beginning of your handler), 8088 MPH now reports 1563 metric cycles. Still something is missing?
Edit: I do remember some of your instructions(like HLT) adding some cycles based on T-state and REP prefixes? Could that be it?
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io
wrote:OK.
With the latest improvements(missing REP cycles and instruction startup cycle(1 cycle at the beginning of your handler), 8088 MPH now reports 1563 metric cycles. Still something is missing?
Edit: I do remember some of your instructions(like HLT) adding some cycles based on T-state and REP prefixes? Could that be it?
Can't really tell without looking at cycle-by-cycle instruction traces from your emulator and XTCE (or real hardware). There is lot of complexity to the cycle timings, as you can see from gentests.cpp. Particularly where the interactions between the Execution Unit, Bus Interface Unit and prefetch queue come into play (i.e. where in the execution of each instruction each bus access starts). And there are still a lot of timings that I don't really understand why they are what they are (basically everything in the busInit() function, which can probably be drastically simpler once it's properly understood).
Btw, you probably should not expect 8088 MPH to run correctly even once you have reached 1678 cycles because the timings of the hardwareCheck test code can be correct without the individual instruction timings (and the Kefrens bars inner loop code) being correct. I'm working on making a much more thorough testsuite (based on all the tests that ever failed while I was making XTCE) which is what you'll really need!
This is what I've changed in my 808X emulation with your timings: https://bitbucket.org/superfury/unipcemu/diff … 52c69371236334a
Can you see if that's correct? There are still 100-ish cycles missing somewhere(assuming 8088MPH reports cycle difference and not PIT counter difference?)
Do you know what instructions/opcodes 8088 MPH uses to count those cycles(what adds up to those cycles)?
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io
This is the hardware check code:
procedure hardwareCheck;{Performs a speed test and warns the user if they're about to run the demo on asystem it was not designed for (or an emulator). Speed test adapted from theTOPBENCH opcode metric, however it is just an opcode exercise and is not meantto be used for anything serious. "I'm a synthetic."It took a bit of tuning and iteration to find something that produceda consistent result!}constfoow=$1234; foob=$12; {dummy seed constants for the routines}observed=1678; margin=10;speedfound:boolean=false;vidramloc:pointer=ptr($b800,0);varcycles:longint;begin{enterLockstep;} {locked up keyboard a few times in testing, will debug l8r}asm{ensure that we're on the default DRAM refresh, as our dev systemmay have left us in some other state}mov al,$54 {TIMER1 OR LSB OR MODE2 OR $00}out 43h,almov al,18out 41h,al(*causing bugs, will debug later *)clicall _LZTimerOn@@speedinit:{Perform some CGA accesses so CGA wait states can align us}(* disabled because it caused too much variance between samplespush dsles di,vidramloclds si,vidramloccldmov cx,64{hit both aligned and unaligned}lodswstoswlodsbstosbmovsbmovswrep lodsbpop ds*){init some regs to be non-empty, others in consistent state}mov ax,foowxor bx,bxmov cx,bxmov dx,$5678mov si,bxmov di,bx
{Start exercising opcodes in roughly order of encoding.}add ax,foow {accum, imm16}add dx,foow {reg, imm16}add al,foob {accum, imm8}add dl,foob {reg, imm8}add [w],ax {mem16, accum}add [w],dx {mem16, reg}add ax,[w] {accum, mem16}add dx,[w] {reg, mem16}add ax,dx {accum, reg}add dx,ax {reg, accum}add al,dl {accum, reg}add dl,al {reg, accum}push espop esor ax,foow {accum, imm16}or dx,foow {reg, imm16}or al,foob {accum, imm8}or dl,foob {reg, imm8}or w,ax {mem16, accum}or w,dx {mem16, reg}or ax,w {accum, mem16}or dx,w {reg, mem16}or ax,dx {accum, reg}or dx,ax {reg, accum}or al,dl {accum, reg}or dl,al {reg, accum}push cspop es {POP CS is an undocumented opcode that works on8088/8086, but we're not going to use it as it meanssomething completely different on later processors}adc ax,foow {accum, imm16}adc dx,foow {reg, imm16}adc al,foob {accum, imm8}adc dl,foob {reg, imm8}adc w,ax {mem16, accum}adc w,dx {mem16, reg}adc ax,w {accum, mem16}adc dx,w {reg, mem16}adc ax,dx {accum, reg}adc dx,ax {reg, accum}adc al,dl {accum, reg}adc dl,al {reg, accum}push axmov ax,sppush sspop ss {halts all interrupts including NMI for next instr.}mov sp,axpop axsbb ax,foow {accum, imm16}sbb dx,foow {reg, imm16}sbb al,foob {accum, imm8}sbb dl,foob {reg, imm8}sbb w,ax {mem16, accum}sbb w,dx {mem16, reg}sbb ax,w {accum, mem16}sbb dx,w {reg, mem16}sbb ax,dx {accum, reg}sbb dx,ax {reg, accum}sbb al,dl {accum, reg}sbb dl,al {reg, accum}push dspop dsand ax,foow {accum, imm16}and dx,foow {reg, imm16}and al,foob {accum, imm8}and dl,foob {reg, imm8}and w,ax {mem16, accum}and w,dx {mem16, reg}and ax,w {accum, mem16}and dx,w {reg, mem16}and ax,dx {accum, reg}and dx,ax {reg, accum}and al,dl {accum, reg}and dl,al {reg, accum}seges mov ax,[bx] {segment override ES is opcode 26h}daasub ax,foow {accum, imm16}sub dx,foow {reg, imm16}sub al,foob {accum, imm8}sub dl,foob {reg, imm8}sub w,ax {mem16, accum}sub w,dx {mem16, reg}sub ax,w {accum, mem16}sub dx,w {reg, mem16}sub ax,dx {accum, reg}sub dx,ax {reg, accum}sub al,dl {accum, reg}sub dl,al {reg, accum}segcs mov ax,[bx] {segment override CS is opcode 2Eh}dasxor ax,foow {accum, imm16}xor dx,foow {reg, imm16}xor al,foob {accum, imm8}xor dl,foob {reg, imm8}xor w,ax {mem16, accum}xor w,dx {mem16, reg}xor ax,w {accum, mem16}xor dx,w {reg, mem16}xor ax,dx {accum, reg}xor dx,ax {reg, accum}xor al,dl {accum, reg}xor dl,al {reg, accum}segss mov ax,[bx] {segment override SS is opcode 36h}aaacmp ax,foow {accum, imm16}cmp dx,foow {reg, imm16}cmp al,foob {accum, imm8}cmp dl,foob {reg, imm8}cmp w,ax {mem16, accum}cmp w,dx {mem16, reg}cmp ax,w {accum, mem16}cmp dx,w {reg, mem16}cmp ax,dx {accum, reg}cmp dx,ax {reg, accum}cmp al,dl {accum, reg}cmp dl,al {reg, accum}segds lodsw {segment override DS is opcode 3Eh}aasinc axinc cxinc dxinc bxinc siinc didec axdec cxdec dxdec bxdec sidec dipush axpush cxpush dxpush bxpush bppush sipush dipop dipop sipop bppop bxpop dxpop cxpop ax{Jcc and JMP tests -- timings are identical for most forms so we willonly test a few. jcxz is the only one with different timings so it isexplicitly tested as well.}xor cx,cx {zero out cx}dec cx {cx := -1}stc {set carry flag}jc @L1 {jump if carry - yes}nop@L1:clc {clear carry flag}jc @L1 {jump if carry - no}inc cxjcxz @L1 {jump if cx=0 - yes 1st pass, no 2nd}sub cx,2jmp @L3@L2:inc cxclc@L3:jbe @L2 {jump if cf=1 or zf=1}mov cx,2@loopfun:noploop @loopfun@endofJMPtests:{test has optimized forms for accumulator}test ax,foow {accum, imm16}test dx,foow {reg, imm16}test al,foob {accum, imm8}test dl,foob {reg, imm8}test w,ax {mem16, accum}test w,dx {mem16, reg}test ax,w {accum, mem16}test dx,w {reg, mem16}test ax,dx {accum, reg}test dx,ax {reg, accum}test al,dl {accum, reg}test dl,al {reg, accum}lea ax,[w]{8e mov segreg,rmw}mov es,[bx+si+1234h]nopxchg w,ax {mem16, accum}xchg w,dx {mem16, reg}xchg ax,w {accum, mem16}xchg dx,w {reg, mem16}xchg ax,dx {accum, reg}xchg dx,ax {reg, accum}xchg al,dl {accum, reg}xchg dl,al {reg, accum}cbwpush dspop esmov di,si {es:di = ds:si}movsbmovswmovsbmovswlodsbstosblodswstoswlodsbstosblodswstosw {tests both aligned and unaligned moves}cmpsbcmpswcmpsbcmpsw {aligned and unaligned}scasbscaswscasbscasw {aligned and unaligned}mov al,foobmov cl,foobmov dl,foobmov bl,foobmov ah,foobmov ch,foobmov dh,foobmov bh,foobmov ax,foowmov cx,foowmov dx,foowmov bx,foow{A lot of hassle just to test the mov encodings of sp and bp :-P }mov si,foowmov di,foowles bx,[foow]mov bx,$FFFFrol bl,1rol [b],1ror bl,1ror [b],1rcl bl,1rcl [b],1rcr bl,1rcr [b],1shl bl,1shl [b],1shr bl,1shr [b],1sal bl,1sal [b],1sar bl,1sar [b],1rol bx,1rol [w],1ror bx,1ror [w],1rcl bx,1rcl [w],1rcr bx,1rcr [w],1shl bx,1shl [w],1shr bx,1shr [w],1sal bx,1sal [w],1sar bx,1sar [w],1{Nybble work is common, so let's choose 4. Higher values could be used,but can be optimized out (ie. rol al,5 = ror al,3) so we'll avoid them.}mov cl,4rol bl,clrol [b],clror bl,clror [b],clrcl bl,clrcl [b],clrcr bl,clrcr [b],clshl bl,clshl [b],clshr bl,clshr [b],clsal bl,clsal [b],clsar bl,clsar [b],clrol bx,clrol [w],clror bx,clror [w],clrcl bx,clrcl [w],clrcr bx,clrcr [w],clshl bx,clshl [w],clshr bx,clshr [w],clsal bx,clsal [w],clsar bx,clsar [w],claadnop {slightly more than 8088 prefetch queue}nopnopnopnopnopaamnop {slightly more than 8088 prefetch queue}nopnopnopnopnopxlatmov ax,foowmov dx,$5678 {get non-zeros in registers again}cmcnot dlnot axneg dlneg ax{mul/div tests. Values inspired by "PIT ticks to usec" conversion}mov dx,8381mul dxmov bx,10000div bxnop {slightly more than 8088 prefetch queue}nopnopnopnopnopimul dxnop {slightly more than 8088 prefetch queue}nopnopnopnopnopidiv bxclcstcpushfcldstdpopfmov ax,foow {accum, imm16}mov dx,foow {reg, imm16}mov al,foob {accum, imm8}mov dl,foob {reg, imm8}mov w,ax {mem16, accum}mov w,dx {mem16, reg}mov ax,w {accum, mem16}mov dx,w {reg, mem16}mov ax,dx {accum, reg}mov dx,ax {reg, accum}mov al,dl {accum, reg}mov dl,al {reg, accum}{don't forget some segment overrides:}mov dx,cs:[bx]mov dx,ss:[bp]mov dx,es:[si]mov dx,ds:[di]lea bx,vidramlocpush word ptr [bx]pop word ptr [bx]call _LZTimerOffstiend;cycles:=_lztimercount;if (cycles >= (observed-margin)) and (cycles <= (observed+margin))then speedfound:=trueelse speedfound:=false;if not speedfoundthen writeln('Metric cycle count of ',cycles,' deviates ',round(abs((cycles-observed) / observed * 100)),'% from what we were expecting.');writeln('4.77 MHz 8088: ',speedfound);if speedfound then begin{print message; wait 4 seconds}writeln(#13#10'Thanks for running this on real hardware!');cycles:=ticksSinceMidnight+(18*4);repeat until ticksSinceMidnight>=cycles;end else beginwriteln(#13#10'This system is not the intended target for this program.');writeln('This demo only runs properly on a 4.77MHz 8088 with a real CGA card.');writeln('Running it on anything else will at best look incorrect, and at worst');writeln('may PERMANENTLY DAMAGE YOUR MONITOR. If you continue, you agree that');writeln('the creators of this program cannot be held responsible for damages!'#13#10);writeln('Sure you want to continue? (Y/N)');repeat until not keypressed;if upcase(readkeychar)<>'Y' then halt;end;end;
wrote:There are still 100-ish cycles missing somewhere(assuming 8088MPH reports cycle difference and not PIT counter difference?)
It reports PIT counter difference 😀
wrote:It reports PIT counter difference 😀
Yes, the _LZTimerOn/Off calls are an adaptation of Abrash' Zen Timer, which is baesd on the PIT timer: http://www.jagregory.com/abrash-zen-of-asm/#the-zen-timer
Oddly enough, almost all of those instruction timings are already implemented according to Reenigne's code. Although I've skipped the instructions resulting in 0 cycles(no wait(n) function in your code), or instructions resulting in nonsensical cycles for a situation(look for the EU_CYCLES constants in my code for those). Although they're mostly jumps.
Those are:
- CMP(might have forgotten that one, or slightly unclear, thus unfinished).
- MOV modr/m memory to reg.
- PUSH/POP segreg.
- PUSH/POP reg.
- PUSHF/POPF.
- IN AL/AX,DX
- OUT DX,AL/AX.
- FF /4 JMP Ev memory.
- FF /5 JMP Mp memory.
- FF /6 PUSH Ev.
Any idea on those? Or have I simply forgotten those?
Edit: Fixed CMP. Was simply forgetting those it seems(easy enough to figure out, though).
So now all that's left is that MOV, those PUSHs and POPs, those IN/OUT accumulator,DX and the GRP5 JMP opcodes.
Edit: 8088 MPH now reports 1539 cycles. So 139 PIT cycles off(divide by 4 for 8088 cycles, so ~34 8088 cycles off, with 1.5 cycles allowed)?
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io
Can that hardware check code be ran from the boot sector at 0000:7c00? That way, we might be able to find offending timings for instructions(as well as having an easy hardware reference from a floppy(or the XT server)? Just place CLI HLT at the end to make it stop the CPU?
Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io