VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

To make my emulation CPU more cycle-accurate, would it be benificial to create a 16-bit kind of bytecode/microcode blocks that's stored inside the emulator for every virtual (8086/8088) CPU instruction, which is actually the basic steps done by a CPU to execute it's instruction? Like load from memory, load from ModR/M, fetch, execute processess(the actual core that's the calculation part of the instruction, like adding two numbers together or in any other way affecting state, without accessing memory), store ModR/M, store immediate? So essentially the same as Modern RISC processing?

Would this result in better and more accurate CPU emulation than a simple function doing all above(except the ModR/M instruction reading)?

Maybe some kind of microcode the emulator executes in parallel to normal hardware and the prefetch unit? So the microsequencer executes a basic action (fetch parameter, modrm memory, modrm register byte/word/dword into (temporary) CPU register, toggle bit size(8/16/32-bit), execute basic instruction(adding numbers together etc.), store 8/16/32-bit data to memory(modr/m, direct from parameter)), after which the prefetch and other hardware update their state(prefetch fetching from memory when possible and when it has enough room to store it in it's buffer).

Then each of those 'microsequencer' instructions can replicate the timings of the 8086/8088 to implement the cycle-accurate 8086/8088 CPU? Anyone?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 5, by Scali

User metadata
Rank l33t
Rank
l33t
superfury wrote:

Would this result in better and more accurate CPU emulation than a simple function doing all above(except the ModR/M instruction reading)?

I think the main advantage would be that it's easier to develop and maintain.
Instead of having a function for each instruction, you just write a list of 'lowlevel' operations (sort of microcode) to execute it.
Then you just have to implement functions for each lowlevel operation.
You can insert things like wait cycles or other sync points as lowlevel ops. So once you have the basic operations done, you should be able to fine-tune each instruction individually.

This way you'd be modeling the CPU internals quite explicitly and accurately.
At this level you should also be able to emulate interaction with other devices at the bus level (eg a video card inserting wait states via the ISA bus).
The instruction will execute up to the part where it actually accesses the bus, and then you wait until the bus is free before you finish the instruction.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 2 of 5, by superfury

User metadata
Rank l33t++
Rank
l33t++

So what should those 'lowlevel' operations be? I understand parts about "read a parameter from the prefetch"(with block), "read immediate address from memory", "read modrm from memory" etc., but what should be implemented at the core of execution commands? Simple 'add','sub','wait for next cycle on the bus to continue'? What about more complicated commands? Should I just create one for every 16&32-bit instruction variant(with 16-bit vs 32-bit variants being determined by a special lowlevel operation enabling 32-bit, 16-bit or 8-bit for any following instruction that supports it, kind of like x86 instruction prefixes)?

Finally, when all that is implemented, when should the prefetch fetch a byte from memory? When the CPU executes an action lasting at least 4 cycles without accessing memory(execution phase)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 5, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

Take a look at https://github.com/reenigne/reenigne/tree/mas … rapa/i8088cpu.h - lines 796 to 1848 (at the time of writing this comment) make up a state machine. The "states" here roughly correspond to your low-level operations (except that I haven't implemented cycle-exact multiplication and division). I think this also roughly corresponds to how the real CPU works (not exactly, or all my states would take 1 cycle).

The prefetch queue fetches a byte from memory when all three of these are true:
* there's space in the prefetch queue to put it,
* the bus isn't busy doing something else, and
* the execution unit hasn't turned off prefetching (which happens during instructions like JMP where the result would be useless anyway).

Reply 4 of 5, by Jepael

User metadata
Rank Oldbie
Rank
Oldbie

Not sure about being more cycle accurate, but it does seem a bit wasteful to decode each instruction even if you have already decoded them once. Converting it to custom bytecode once is nice, and you can most likely execute the custom bytecode with less decoding, but you still interpret the bytecode, and have to keep track of what x86 physical address corresponds to which address in the bytecode space.

So, what about function pointers? When you decode an instruction such as "MOV AX,IMMED16", you just store a function pointer to a function (or rather an index to a table of function pointers) that knows what to do, in this case pop the opcode byte from prefetch queue and set AX variable accordingly.

Reply 5 of 5, by superfury

User metadata
Rank l33t++
Rank
l33t++

@Jepael: The function pointer table currently already happens for the entire 'execution' phase(reading data from modR/M address or direct address, modifying registers, writing data to memory at the end if required). The problem is that, when those seperate phases have to be emulated seperately(like the time of internal and external processing(RAM accesses itself vs internal register modifications, data addition etc.)) also affect the prefetch itself. There's a big difference between "reading from memory, processing to register, storing into memory" and "reading, processing to register, reading, processing to register, storing into memory" when the prefetch can activate during the "processing to register" phase(e.g. storing the data in a register, adding the values together, substracting the values, so any math operations and register loading/saving etc.). That can make all the difference in the cycles issued on the bus(memory access times interleaved with internal processing) and CPU. Especially with stuff like CMPSB, XLAT etc, since they affect multiple memory locations(reads and/or writes interleaved with waiting times(in which it does internal processing into (temporary) registers).

So the problem here is in the prefetch queue filling and waitstates. If I put it all in a single function pointer system(like the current jumptable it uses), I can't seperate those phases(and make it affect the prefetch queue, as it can only run before or after the jumptable function is called).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io