Well, what it's currently doing is prefetching when needed(Using 4 memory cycles for every one of the bytes read from memory, which gets redirected to the cycles_Prefetch variable instead of cycles_MMUR), starts the instruction handler, which can read ModR/M and parameters through the same cycles_Prefetch, then executes by reading from memory(which adds 4/8 cycles to cycles_MMUR for every byte/word read on the 8088) when needed(Memory source), performing the operation, writing back the result to memory(to cycles_MMUW) and finally calculates cycles(from the tables in the documentation) and adds EA cycles when used, saving it in cycles_OP. This assumes the timing in the documentation excludes memory/io cycles.
Finally, during cycle calculation after the execution returns, the CPU core calculates the cycles spent using the earlier formula given, saving it in the cycles variable, which is converted to realtime nanoseconds by the emulator core.
This value, after substracting cycles_MMUR, cycles_Prefetch and cycles_MMUW gives the amount of cycles spent on execution only. Those cycles are divided by four to obtain the time possibly spent on prefetching(assuming the prefetch queue has free space). This is then used to prefetch that many bytes of instruction data from memory.
The DMA and other hardware simply apply the increase in realtime to their own realtime ns counters and update their timing by spending their time on their cycles at it's own speed(e.g. 1.19MHz PIT ticks, 4.77MHz/4 DMA ticks, 14MHz pixel clocks, 49kHz adlib audio etc.). Though DMA will have to be moved after the PIT updating now to perform that timing correctly though with the new information in mind.
Btw do I actually need to modify the entire CPU execution to perform seperate cycles and states(T1, T2 etc.) seperately instead of functionally(fetch instruction prefixes and instruction opcode, (opcode function jumptable: )fetch modrm, fetch parameters, execute instruction(including reads/writes), return, (returned to the CPU jumptable caller: ) calculate cycles by saved cycles seperately(cycles_OP etc.), execute prefetch for rest cycles after substraction.
That is currently executed for every instruction to execute, in that order, using a general core, (fetch, decode&execute(one step in the opcode jumptable function), update prefetch queue way.