You say you substract them from the EU cycles value at the end? You'll have a problem if those "fake cycles" exceed the amount of used cycles in EU_PHASE_XXX. It will underflow into a huge amount or invalid(none) amount of cycles? Say 5 cycles used with 8 "fake cycles". Thus you'll either get around 65333 cycles to spend or none(when using simple <-operator comparison on a signed number).
I've just finished adjusting the BIU, ModR/M and Stack handling of UniPCemu. It should now be able to handle (as an 8-bit BIU only atm) all possible requests to the BIU by the EU. It works using two 1-entry queues: One queue for requests, One queue for responses(which are 1 for write BIU cycles and the value read for read BIU cycles). The request queue is to be filled by the CPU with access requests. The BIU pops an entry off this queue and starts processing cycles(in parallel with the EU) until the MMU or BUS I/O is complete. One it's complete, it will push a response(1 for writes, value read for reads) on the response queue.
The EU works during writes/reads by first queueing a request for a MMU/IO read/write and starts idling 1 cycle at a time. Once the response buffer is filled, it will pop off the response(1 or memory/BUS value read) and continue execution. Different stages are to be done by using a simple increasing counter to keep the current execution state to return to(a series of if-counter-equals STEP1 else if counter equals STEP2 else finishinstructionwith0cycles(or delay EU when needed). The STEP* points simply will call the function to do something(BIU queue functionality or response), add some cycles to delay(1 cycle when waiting for the request to complete or be able to add a request) or perform an action and delay some(actual execution cycles).
Using such a step-based system allows for the EU to roughly do the same as your version of those execution queues in your EU files, only it's done with one or a few counters in my case.
I'm still thinking about how to handle the interrupts etc. Things get a bit complicated when handling 80286+ interrupts using this method as well, as they can be nested.
The implementing of that step-based system into the 8086 core is the only big thing that's still left to do to make it run more like your emulator(and more cycle-accurate in general). Currently the (EU) cycle counts of the instructions haven't been adjusted yet to exclude fetching from the PIQ or memory access cycles(4/access(either byte or word, since those cycles are 8086 cycles afaik(4 cycles need to be added manually to get 8088 timings. This is currently done by adding to the total cycles by adding to the cycles_MMUR, cycles_MMUW or cycles_IO variables)), although the EA timings have already been seperated and moved back to the start of the 'execution-phase'(for lack of a better term to describe the actual execution of an instruction itself(which is essentially everything after the prefetch-phase(the prefetch phase are the opcode fetching, modr/m fetching, parameter fetching, 1-cycle idle timings during those and EA decode cycles). Although, currently, the EA decode immediately is absorbed into the first stage of the 'execution-phase', instead of being seperated in order for the BIU to do a little work before actually starting instruction execution.
Edit: I've just modified the EA cycles to be consumed first by the BIU(and hardware), before starting the 'execution-phase'. Now the opcode-specific handler itself starts once the EA cycles(if any) have completed.