VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

Is there a general formula I can use to calculate my 80(1/2/3/4/5)86 CPU instruction cycles per instruction? I know that the MODR/M decoding seems to use a certain amount of cycles to execute and that the fetching of bytes from memory seems to have some determined amount of cycles to fetch(4 cycles per byte fetched into the Prefetch Input Queue, maybe less when already buffered?)? Anyone knows a general formula I can use to make my x86 emulator use more accurate cycles for each instruction?

Like: ModR/M + Instruction base + Instruction extra + Prefetch ready bytes + Prefetch fetch bytes.

Anyone? Reenigne? Jepael?

Last edited by superfury on 2016-05-28, 14:34. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 40, by Scali

User metadata
Rank l33t
Rank
l33t

No.
Every CPU has its own type of pipeline, and you need to emulate every part of it that has any external effects as far as timing goes.
Note also that you can't just emulate the CPU in a vacuum.
You also need to keep in mind that other devices can claim the bus and steal cycles. How this affects the CPU will also differ from one CPU to the next.
And don't forget things like caches, so not all access go directly to the data bus.

It's not as simple as the formula you propose.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 2 of 40, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

get each datasheet for each cpu. that will break down the cycles exactly.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 3 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

I think I'll just go the easy way and implement the timing mentioned at http://www.oocities.org/mc_introtocomputers/I … tion_Timing.PDF .
This should at least provide some basic emulation of CPU speed(Only really required for the 80(1)86 CPUs, as they have the most cycle accurate software written for them). I'll just treat the 80186 the same as the 8086 for now(with the 8086 instructions), making the 80186 instructions the default for the processor (8 cycles on 16-bit, 9 cycles on 8-bit) which is already applied atm.

Currently already implemented those NEG, DEC/INC and ADD/SUB instruction timings.

Although I currently don't know what to do with the range instructions (like MUL etc.), which specify a range of cycles. Just take the higher number?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 40, by Scali

User metadata
Rank l33t
Rank
l33t
BloodyCactus wrote:

get each datasheet for each cpu. that will break down the cycles exactly.

Nope. Some details are not covered in the datasheets, and in some cases you need to monitor actual hardware to see what it is doing exactly.
This is what reenigne is doing with an ISA card with microcontroller he has made, to study the 8088.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 5 of 40, by Scali

User metadata
Rank l33t
Rank
l33t
superfury wrote:

I'll just treat the 80186 the same as the 8086 for now(with the 8086 instructions), making the 80186 instructions the default for the processor (8 cycles on 16-bit, 9 cycles on 8-bit) which is already applied atm.

Why would you make the 80186 the default?
The 186 is a weird kind of 'microcontroller' where some of the chipset is integrated into the CPU.
Problem is that this integrated functionality is not compatible with the MCS85 chipset that the PC/XT/AT are built around, so you can't build a PC-compatible machine with an 186 CPU.
So an 186 has no place in a PC emulator in the first place, and even if you do support it, there's no point in making it the default, because then you're emulating a machine that cannot physically exist.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 6 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

Well, currently it allows using software requiring the 286+ real mode instructions (Like the Windows 3.0 VGA driver and the MS-DOS game Hocus Pocus). Maybe I should keep the '186' mode intact, but implement those instructions using 286+ timings?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 7 of 40, by Scali

User metadata
Rank l33t
Rank
l33t
superfury wrote:

Well, currently it allows using software requiring the 286+ real mode instructions (Like the Windows 3.0 VGA driver and the MS-DOS game Hocus Pocus). Maybe I should keep the '186' mode intact, but implement those instructions using 286+ timings?

How about making it into NEC V20/30 instead?
The V20/30 use the 186 instructionset (but have different timings), but are 'regular' x86 CPUs, and can be found in actual PC-compatibles.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 8 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just modified the full 80186 information to become NEC V20/V30 instead(further differentiation between V20/V30 is 8/16 bit data bus).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've modified my 8086 emulation core to use the timings as specified in the document ( http://www.oocities.org/mc_introtocomputers/I … tion_Timing.PDF ). Now most instructions have their timings implemented.

I'm still missing a few though:
- The LXS instructions (LDS, LES, LSS). Currently assumed the MOV SegReg, Mem instruction timings for them.
- The flags set&clear instructions (STC, CLC, CMC, HLT etc.) Currently assumed 2 cycles for them.
- The repeatable string instructions (MOVSB/W, STOSB/W, LODSB/W, SCASB/W).
- The IN/OUT instructions.
- The Reserved opcode F1.
- The SALC instruction.
- The algorithm instructions (DAA, DAS, AAA, AAS, AAM, AAD).

Anyone knows those timings?

Edit: Found something that describes the timings:
http://matthieu.benoit.free.fr/cross/data_she … sers_Manual.pdf

Edit: Implemented those timings into the functionality that still needs them. Now 8088 MPH gives me a percentage of 39%(accurate)? Although I didn't apply memory and I/O timings for different hardware yet.

The credits music sounds a bit better, but still not completely right though? The same applies to the Kefrens bar effect(although it's too small to see more when using the Direct Plot setting, which this current 2.0GHz CPU requires to even have a bit of speed(running at 30% of realtime speed, 100% speed is realtime(emulated CPU and hardware running at full speed, as by their requirements))).

This is the 8086/8088 source code, which load their cycles spent into CPU[activeCPU].cycles_OP(Which gets combined(added) with timings by the CPU core itelf, which handle hardware interrupts(IRQs) timings, Exceptions timings and Prefix timings to create the total time spent on the instruction(and other things like the IRQ and exception for the current instruction)). This amount of cycles is used by the emulator to (after addition like mentioned) convert to time spent on the instruction. This eventually gets added to the global time for CPU(and also hardware by extension, since their timings are based on the CPU timings provided). The time that is left is the current emulation time, which is synchronized to realtime using high resolution clocks(depending on the system, the slowest and final(catch-all) usage being the SDL_Ticks() function for crude synchronization, compared to Windows and PSP's high resolution clocks). Thus the whole emulation is kept together this way, based on the cycles the CPU spends in it's instructions, the actual time(time the computer is running) and of course the speed setting, which converts the ticks spent by the CPU into realtime (The Default setting making the CPU run at 4.77MHz, any other value converts to Instructions per millisecond instead(Dosbox's cycles setting equivalent)).

https://bitbucket.org/superfury/x86emu/src/c1 … 086.c?at=master

Anyone can see if there's anything going wrong here? Or is this simply because I haven't implemented memory wait times and I/O wait times yet?

8088 MPH tells me:

Metric cycle count of 1028 deviates 39% from what we were expecting.
4.77 MHz 8088: FALSE

Edit: Just added wait states to the CGA memory, which waits 1 character clock (8 pixels), then 1 other clock (16 pixel multiple being executed by counting pixel (byte variable which overflows to 0, waiting for bits 0-4 to become zeroed to get to the next 16 pixel multiple), finally requesting the CPU to start after the next cycle(ccycle in reenigne's blog).

Is this correct? It's just done using three counters in the VGA/CGA/MDA's pixel emulation.

++VGA->PixelCounter; //Simply blindly increase the pixel counter!
if (VGA->WaitState) //To excert CGA Wait State memory?
{
switch (VGA->WaitState) //What state are we waiting for?
{
case 1: //Wait 8 hdots!
if (--VGA->WaitStateCounter == 0) //First wait state done?
{
VGA->WaitState = 2; //Enter the next phase: Wait for the next lchar(16 dots period)!
}
break;
case 2: //Wait for the next lchar?
if ((VGA->PixelCounter & 0xF) == 0) //Second wait state done?
{
VGA->WaitState = 0; //Enter the next phase: Wait for the next ccycle(3 hdots)
CPU[activeCPU].halt |= 8; //Start again when the next CPU clock arrives!
CPU[activeCPU].halt &= ~4; //We're done waiting!
}
break;
case 3: //Wait for the next ccycle(3 hdots)?
default: //No waitstate?
break;
}
}

While CPU[activeCPU].halt isn't zero, the CPU is in a HLT state, waiting for various things to resume CPU emulation:
When Bits 2 or 3 are set, it checks if only bit 3 is set. If bit 3 is set, CPU emulation is resumed and executed the next cycle. Else it consumes one HLT cycle (4.77MHz cycle).
Else, When Bit 2 is set, it consumes one HLT cycle(at 4.77MHz) and just handles hardware timing(VGA, Adlib, PIT, LPT, COM port etc.), skipping CPU timing.
Else, When Bit 1 is set, the CPU is in HLT state. When an hardware interrupt arrives and it's acnowledged, the hardware interrupt fires and the CPU execution is immediately resumed.
When bits 0-3 are zeroed, the CPU is running normally, executing instructions.

8088 MPH gives me a metric cycle count of 2243 deviating 34% from what the software is expecting. Anyone knows why this is? Reenigne? Scali?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 10 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

Anyone knows what the exact difference is between 8086 and 8088 timings with the manuals I linked to? Just 4 cycles added on 8086 odd memory addresses and 4/8/16 cycles(depending on the memory addressing count reads/writes) on the 8088?

After the latest additions in timings, I get a 8088 MPH metric cycle count of 1060, which deviates 37% from what the software is expecting. Does this mean I'm close to the timing it expects? What timing does it expect exactly?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 11 of 40, by Scali

User metadata
Rank l33t
Rank
l33t

37% isn't exactly close 😀
From the sourcecode:
observed=1678; margin=10;
So it expects 1678, with a margin of +/- 10, which is less than 1%.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 12 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

Do I need to make the BIU, Prefetch and Execution unit seperate and operating in parallel like the rest of the hardware to make this work completely? So I have to split the fetching and execution and I/O(MMU and I/O ports) in seperate stages and line them out in time for the Prefetch and Bus cycles to work properly?

So like this:
CPU state 0: Fetching from Prefetch or waiting for it to fill.
CPU state 1: Fetching ModR/M data from prefetch or waiting for it.
CPU state 2: Fetching parameters or waiting for prefetch.
CPU state 3: Execution phase
CPU state 4+: Execution/Data(MMU or I/O) depending on the instruction.

Prefetch: Every 4 cycles grabbing a byte from current address and increase address.

BIU: Doing the MMU I/O or hardware I/O every 4 cycles when requested by Prefetch or Execution unit.

Is this about correct?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 13 of 40, by Scali

User metadata
Rank l33t
Rank
l33t

The 8088 MPH test also does some testing on CGA memory access, so you need to simulate waitstates on that as well. Just emulating a CPU-core in a vacuum won't work.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 14 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

Like I said, the 1060 and 2243 counts already include the time added by the VGA(for CGA actually, but it's a shared object for all the VGA/CGA/MDA cards emulated). So these actually are the instruction clocks in the manual, prefixes(2 each prefix, plus 2 each ModR/M segment override), the higher one including 4 cycles per memory access(byte and word), 4 cycles per Prefetch from memory and the specified cycles for the EA calculations.

Prefetch is constantly fully filled after each instruction executed or when fetched empty.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 16 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

The DMA controller's memory refresh currently just adds 4 cycles to the executed time each time it ticks.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 17 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

After adding better prefetching (one byte at a time instead of constantly fully filling), I'm now getting a metric cycle count of 2276(deviating 36%).

The current formula for cycles spent on an execution is:
total cycles = cycles_OP + cycles_HWOP + cycles_Prefix + cycles_Exception + cycles_Prefetch + cycles_MMUR + cycles_MMUW

cycles_OP = cycles from the beforementioned tables + EA timing when used + 16-bit memory overhead(in the case of odd 8086 memory accesses and 8088 16-bit data bus access)
EA timing = the timing mentioned in the table(last document) + 2 cycles when a segment override prefix is used.
cycles_HWOP = Same as cycles_Exceptions below, but for IRQs.
cycles_Prefix = 2 cycles for each prefix issued before the current instruction.
cycles_Exception = the cycles consumed when an Exception triggers (DIV0 or comparable).
cycles_Prefetch = The total memory cycles spent prefetching instructions during instruction execution(Same input values as cycles_MMUR, but seperated for Prefetch Input Queue fetches when the PIQ is empty)
cycles_MMUR/W = The amount of cycles spent on memory reads and writes (both 8 and 16-bit), according to the MMU. This is 4 cycles each read/write. Becomes 8 cycles with 8088 16-bit reads/writes.

Prefetch fills after execution for every 4 cycles not spent on execution according to the formula:
prefetchfills = Total cycles - cycles_MMUR - cycles_MMUW - cycles_Prefetch
The filler (executing after each instruction executed) fills one byte in the PIQ and substracts 4 cycles from prefetchfills while it's >=4(ticking when it has enough time to do so during the current instruction).

The PIQ also ticks/fills when it becomes empty and input is requested by the CPU emulation(for executing instructions or reading instruction parameters etc.).

Are my calculations correct? Or have I made an error with this?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 18 of 40, by Scali

User metadata
Rank l33t
Rank
l33t

I think it may be correct for simple instructions, at first glance.
Perhaps what is tripping you up is complex instructions such as mul and div. Their execution time is dependent on the input, since they use early-out algorithms.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 19 of 40, by superfury

User metadata
Rank l33t++
Rank
l33t++

I have indeed just taken the Max cycles numbers in those cases (like MUL and DIV etc.), where it's mentioned in the manual that it uses a range of cycles. Like:
MUL 8-bit reg: 70-77 cycles. So I use 77 cycles for MUL 8-bit reg.
MUL 16-bit reg: 118-133 cycles. So I use 133 cycles for MUL 16-bit reg.
Even worse:
IDIV 16-bit mem: 171-190 + EA. So I use 190+EA for IDIV 16-bit mem.

Now the problem is: I don't know the formulas for those 'range' instructions. So I just take the bigger end of the range and hope for the best atm. Unless someone has the exact formulas for the MUL, IMUL, DIV and IDIV cycles?

I do notice that the 386 uses such an algorithm as well ( https://pdos.csail.mit.edu/6.828/2014/readings/i386/MUL.htm )

The 80386 uses an early-out multiply algorithm. The actual number of clocks depends on the position of the most significant bit […]
Show full quote

The 80386 uses an early-out multiply algorithm. The actual number of clocks depends on the position of the most significant bit in the optimizing multiplier, shown underlined above. The optimization occurs for positive and negative multiplier values. Because of the early-out algorithm, clock counts given are minimum to maximum. To calculate the actual clocks, use the following formula:
Actual clock = if <> 0 then max(ceiling(log{2}(m)), 3) + 6 clocks;

Actual clock = if = 0 then 9 clocks
where m is the multiplier.

So that's it for 80386 (I)MUL instructions. Don't know about (I)DIV instruction though. Perhaps this can be applied to 8086 MUL/IMUL as well? What about DIV/IDIV?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io