VOGONS


UniPCemu cycle accurate 8088 implementation

Topic actions

Reply 20 of 198, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
reenigne wrote:

I'll see if I can get a bus sniffer dump of this code for you tomorrow so you can see what happens when.

Here. For example, take a look at the POP instruction on line 312. The initial byte (8F) is fetches from RAM to the prefetch queue at line 305, then from the prefetch queue to the EU at line 308. The second byte (07) happens 4 cycles later in both respects. The instruction doesn't actually start having an effect on the bus until line 320, so there's time to fetch the next instruction (bytes B1 and 01) into the queue on lines 310-317.

Reply 21 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

So it takes 8 cycles just to decode the instruction, as far as I can see? Is that correct?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 22 of 198, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

So it takes 8 cycles just to decode the instruction, as far as I can see? Is that correct?

Well, it's 8 from taking the last byte from the prefetch queue to the start of the stack bus access, or 12 from taking the first byte from the prefetch queue to the start of the stack bus access.

I'm not sure "time taken to decode the instruction" is something that is well-defined in general, though - the 8088 doesn't necessarily separate EU operations into a decode step and an execute step. Instead, decoding is a process that happens throughout the execution of the instruction. You can see one of the effects of this in the same dump - in lines 328-333 isn't using the bus (so one might say it's "decoding some more"). Four of those six cycles are used for another prefetch, two are idle.

That raises the question of why the prefetch from address 00F5A didn't start on line 318, since the bus was idle there and the prefetch queue never got a chance to fill up. I think the answer is that the EU knew (from the decode-so-far of opcode 8F) that it was going to be doing a bus operation, so seized control of the bus in order to prevent a fetch from starting (which would have delayed the execution). So really there's an observable consequence of decoding on line 318, when a new fetch does not start.

Reply 23 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

Having implemented a slight 8-cycle delay into the 8F instruction before executing it makes it continue to the end without crashing now. I've modfied the instruction to first wait 8 cycles, then start executing as documented(17+EA cycles(memory) or 8 cycles(register), not including the memory cycles themselves(1 BIU fetch time skipped for each byte accessed in this case).

8088 MPH now reports 1401 or 1400 cycles before starting.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 24 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

Do you or vladstamate have some kind of list for these things(instruction/prefetch/biu observations, how many cycles are used in general, things that probably happen during (un)common instructions etc. Like the cycles list in the 8088/8086 manuals, but for the execution unit only)? The BIU seems pretty straightforward: simply start an access each T4 state? So it's essentially a 2-bit counter? T1-T3 don't need to be emulated(can be emulated at T4 only, although T1 can lock the BUS for DMA(T4 unlocking it)), T4 simply loads/stores to/from virtual memory/IO(BUS) to/from either the EU or PIQ. The EU simply posts requests to the BIU(for I/O or PIQ input, idling in the meantime) or does some work in some cycles((partial)instruction)? What causes those idle cycles? Do you guys have some list of observations so far that I can implement?

What about more modern CPUs? How can you properly emulate stuff like protected mode/exceptions/descriptor loads with the BIU without making stuff too complicated(while keeping the responses(checking) accurate and readable)? Think about faults raising faults vs BIU etc.

Although my system still uses the general 286 documented timings for these(like the 8086+ 'cycle' emulation in current UniPCemu builds, applying timing after executing an instruction only).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 25 of 198, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Do you or vladstamate have some kind of list for these things(instruction/prefetch/biu observations,

No, not other than what I've written here on Vogons.

superfury wrote:

how many cycles are used in general, things that probably happen during (un)common instructions etc. Like the cycles list in the 8088/8086 manuals, but for the execution unit only)?

If I had a list of all that, writing a cycle-exact emulator would be relatively easy!

superfury wrote:

The BIU seems pretty straightforward: simply start an access each T4 state? So it's essentially a 2-bit counter?

It's more complicated than that, because the BIU can also be idle, or in the Tw state (wait state between T3 and T4). And there's all the states to do with DMA accesses. It is at least relatively transparent, though - there aren't any secrets waiting to be discovered.

superfury wrote:

The EU simply posts requests to the BIU(for I/O or PIQ input, idling in the meantime) or does some work in some cycles((partial)instruction)?

It's definitely doing work in between the BIU requests!

superfury wrote:

What causes those idle cycles?

The 8088 is a microcoded processor for many of its instructions (at least the more complicated ones). So the EU is running its own little program (one step per cycle) and we don't have a dump of that program.

superfury wrote:

What about more modern CPUs? How can you properly emulate stuff like protected mode/exceptions/descriptor loads with the BIU without making stuff too complicated(while keeping the responses(checking) accurate and readable)? Think about faults raising faults vs BIU etc.

I wouldn't bother trying to make any system faster than 4.77MHz cycle-exact. With a 4.77MHz machine (where the entire system is driven from a single crystal) it is at least in principle possible to have a program that is timed entirely by counting cycles and which takes the same amount of time every time it is run. The PIT always runs off a 14.318MHz crystal but if your CPU runs at a speed other than 4.77MHz then there will be a separate crystal for the CPU, and the time signals can drift with respect to each other. So even though it might technically be possible to create a cycle-exact emulator for a 286 or later (with the help of a hacked-up system running from a single crystal, and a really good logic analyser) no software could ever be written to take advantage of that so there would be no way to prove that you got it right (to anyone who didn't have a similarly hacked up system and logic analyser).

Reply 26 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've just looked at vladstamate's EU code. It seems he just executes the EA cycles first(are these those 8-cycle delays?), then read, then instruction execution, finishing with write. Although some more complicated versions seem to exist(the INT intructions and related)? In my case, EA is only applied when actually reading/writing memory, adding delay cycles in cycles_OP. This is applied by the BIU directly after the read cycles, but maybe they need to be seperated and put at the start, before the execution phase? Are they also clocked when unused? Like when only one of the two ModR/M operands is used(no memory)?

Last edited by superfury on 2017-04-03, 21:39. Edited 2 times in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 27 of 198, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

I've just looked at vladstamate's EU code. It seems he just executes the EA cycles first(are these those 8-cycle delays?),

Could be! http://stanislavs.org/helppc/instruction_timing.html says the EA penalty for [bx] is 5 cycles.

superfury wrote:

This is applied by the BIU directly after the read cycles, but maybe they need to be seperated and put at the start, before the execution phase?

Yes, I think the EA penalty will only apply once even if the memory is accessed twice (e.g. add rm,rw) - the address will still be in the EU's address slot for the second access.

superfury wrote:

Are they also clocked when unused? Like when only one of the two ModR/M operands is used(no memory)?

For example?

superfury wrote:

Edit: Wait a sec. Isn't the R/M part always used? So the EA cycles always need to be applied, except for when mod==3? That isn't instruction-specific? The remaining read/write to/from memory or register is just normal BIU access doing it's job?

This sounds right to me.

superfury wrote:

Edit: Looking at vladstamate's BIU, it's indeed generating a EA delay first of 5(BX)+1(CS override cycles. So that amounts for 6 out of 8 cycles. What about the remaining two? Are these simply because the BIU starts at T4 only?

I don't think we know enough to be able to account for every cycle.

superfury wrote:

Edit: Are those Tw states caused by waitstate memory and CGA memory? Are they those blank parts in your logs? Do they always start after T3?

The blank lines (no T* at all) are bus-idle states. The Tw state always occurs between T3 and T4. It's not just CGA memory, the IO port wait state and other peripheral-induced wait states - the CPU will also see a Tw when the bus is busy servicing a DMA (the same "READY" signal is used to give the DMA controller ownership of the bus for the duration of its access).

Reply 28 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

My code calculates 5 EA cycles for BX and 2 EA cycles for the CS override. Add 1 cycle for reading the prefetch and you end up with the correct 8-cycle delay? I just need to adjust the 8086 core to apply the EA cycles seperately, like the current prefetch cycles, and apply it to the BIU and debugger. That should automatically fix that bug.

Thinking about the EA cycles again, it's probably used in every instruction that uses ModR/M parameters. Then one strange thing is present: Almost all documentation I see on the 808X+ is having either X cycles for register or X+EA cycles for memory. So this doesn't talk about register or R/M used, but rather the top 2 bits of the ModR/M byte being 11b(Reg cases) vs 00b-10b(Mem cases)?

Edit: After the EA cycle change, it runs at 1261 cycles(EU execution and fetch starting at any T cycle, not just T3(+1) during prefetching.

Edit: 8088 MPH runs like crazy, most CPU-speed sensitive parts being way too fast(Deloringan when touching the top of the screen even/odd lines disappearing, 3D objects super fast turning, Credits music in high speed ffwd, Kefrens Bars going wrong as usual). The remaining parts run without problems.

Edit: Looking at your sniffer log, I see 'I' at T1 and 'S' at T3 most of the time(although they're all over sometimes?). Also, the BIU fetches into the PIQ or normal memory/IO on T4 only. Any idea how this works?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 29 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

Hmmm. 1261/0.75=1681.333333~. Is that a clue as to what's going wrong here? So only 3 out of 4 cycles have passed in the time that it's required to take. So it's running exactly/almost 25% too fast, according to the metric cycle count? Is there something/some things in my emulation that is supposed to run 1 out of 4 cycles slower on average? What does the cycle-counting code contain? What kind of instructions does it execute?

Edit: Could the cause being the BIU not being used for I/O(except for prefetching)? Would changing the modr/m and other memory data really account for 25% of the time that's missing?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 30 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've just finished implementing the BIU side of accessing memory and BUS(I/O) cycles. Now the Execution Unit(the opcode handler itself) can request bytes/words/dwords to be read/written to/from memory or I/O ports(BUS). The way it works is simply that the opcode handler will use the CPU counter(for the instruction itself, Interrupts are a little more complex in this way) to keep state of the current instruction step(substeps within the EU). Each memory or I/O access works the following way(it keeps track of the current BIU action with an simple counter that's incremented once a certain step is done and checked against when the opcode handler is called, in the same way it was done with fetching instructions.

The memory or I/O accesses are split into two stages(the stage is kept until the operation succeeds(Taking 1 EU cycle each time), continuing into the next state when the BIU function returns 1):
Stage X: Repeat calling a BIU_request_* function to request a memory/IO access.
Stage X+1: Repeat calling BIU_readResult* function to request the finished state(this is either a success code(always 1 currently) for writes, or data read from memory/IO for read requests).
Stage X+2: Process the next stage in the function

Of course, stages can also simply be idle time(EU execution time), delaying the EU.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 31 of 198, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

My code calculates 5 EA cycles for BX and 2 EA cycles for the CS override. Add 1 cycle for reading the prefetch and you end up with the correct 8-cycle delay?

Nice idea, but it doesn't seem to be the case. I just tried the same code without the override and it has the same delay. I think those 2 extra EA cycles are accounted for in the execution of the "CS:" prefix rather than in the instruction itself.

superfury wrote:

Thinking about the EA cycles again, it's probably used in every instruction that uses ModR/M parameters. Then one strange thing is present: Almost all documentation I see on the 808X+ is having either X cycles for register or X+EA cycles for memory. So this doesn't talk about register or R/M used, but rather the top 2 bits of the ModR/M byte being 11b(Reg cases) vs 00b-10b(Mem cases)?

Yes... what's the question?

superfury wrote:

Edit: Looking at your sniffer log, I see 'I' at T1 and 'S' at T3 most of the time(although they're all over sometimes?).

I think that's an emergent behaviour rather than a deliberate one.

superfury wrote:

Also, the BIU fetches into the PIQ or normal memory/IO on T4 only. Any idea how this works?

Well, the T4 step of the bus cycle is the first point at which both the address and the data are available, so it's natural for my sniffer program to print the human-readable description of the bus access on that line. Left of column 38 in the sniffer logs is the actual data from the machine, right of column 38 is my program's easier-to-understand decoding of it.

Reply 32 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

Do you have any legend of those raw data left of column 38? How will I know when those I/S fetches are happening? When will they occur?

Also, how will I know the actual amount of EU cycles some instruction stage(execution) takes, looking at those sniffer logs?

Edit:

20F97 .C...  0020B B3 10 FC 5 .Wr..D.  Tw S4 B3 <-d [   0020B]   

What is it doing at this point? That 'D' seems to imply that S* state being running. What does this mean? What is it doing?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 33 of 198, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Do you have any legend of those raw data left of column 38?

I had a legend but it's out of date since I removed some of the columns. Your best bet is probably to look at the code that generates the logs from the raw data at https://github.com/reenigne/reenigne/blob/mas … ffer_decode.cpp.

superfury wrote:

How will I know when those I/S fetches are happening? When will they occur?

I'm not sure I understand the question. They occur when they occur, the sniffer logs show when they occur. Figuring out how to make an emulator that emulates those fetches happening at the right time is the tricky bit - you'll need to look at a lot of sniffer logs and figure out when each byte of each instruction is fetched and how that depends on the state of the bus and prefetch queue.

superfury wrote:

Also, how will I know the actual amount of EU cycles some instruction stage(execution) takes, looking at those sniffer logs?

That is not something that can be directly observed by looking at the CPU from the outside. But if you need a convention for "when does an instruction start?" then the "I" fetch is probably the best one to use.

superfury wrote:
Edit: […]
Show full quote

Edit:

20F97 .C...  0020B B3 10 FC 5 .Wr..D.  Tw S4 B3 <-d [   0020B]   

What is it doing at this point? That 'D' seems to imply that S* state being running. What does this mean? What is it doing?

The "D" means that the AEN line on the ISA bus is high. "This line is used to de-gate the processor and other devices from the I/O channel to allow DMA transfers to take place. When this line is active (high), the DMA controller has control of the address bus, data bus, read command lines (memory and I/O), and the write command lines (memory and I/O)."

The S* states are the equivalent of the T* states but for bus accesses that are initiated from the DMA controller rather than from the CPU. The "<-d" arrow indicates that the bus access is a DMA one. The DMA controller fetched byte B3 from physical address 0020B, causing addresses xxx0B to be refreshed. The CPU is seeing "Tw" on the bus, so as far as it is concerned there is a wait state for the byte it is trying to fetch.

Reply 34 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've just improved the DMA handling, which shares the BUS with the CPU. When one is working, the other can't perform any BUS transfer cycles(on T4 for the CPU). This should make the DMA slow down the CPU accordingly. Also, the PIT0 input signal to the DMA0 channel is now acnowledged, turning the output to a temporary 0 state until retriggered by going high. This prevents the high PIT output signal(1 for a long time) to keep the DAC busy infinitely, preventing the CPU from fetching any new intructions due to the BUS being constantly busy. The deloran still disappears.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 35 of 198, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

In my emulator the EU is not busy doing EA work always first. EA starts work whenever I detect that I need to calculate an address for a given operand. That can happen at different points in the timeline of an instruction. This also changes the dynamics of then the EU takes bytes out of prefetch.

For example say you have an instruction like this:

MOV [DI+offset], imm

First I fetch the MOV opcode and the rm, decode those and I realize that I need to let EA do some work. But EA now sees that an offset is needed so it instructs the BIU to provide a byte (or 2 depending on the offset size), it waits for that patiently. Once the BIU is able to provide the offset (which might already be in the prefetch queue, but not always) then EA spends some cycles calculating DI+offset then, and only then I realize that I also need "imm" so another request is sent to the BIU to either give me imm from prefetch or go read it.

Now because EA cycles could be more than 4 imm might already be in the prefetch queue (and most likely it is). But it is not a given. Sometimes the EA takes enough cycles that even the next instruction's opcode is prefetched. This emergent behavior is very interesting to watch.

After "imm" is read the EU spends some cycles figuring out it needs to write data out so now it talks yet again with BIU to write something out.

For me, the BIU execution is implemented around a FIFO of requests from either the EU who says give me this, output that, etc or the general: hey there is nothing to do, lets prefetch a byte (or in case of the 8086/286, 2 bytes).

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 36 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

Vlad, in how far are the actual cycles(as documented in the original 8088/8086 user manual) consistent with the EU? Are the cycles provided actually the amount of cycles the EU spends during the execution phase? What about the 80286's interrupts and exception handling(and nested exceptions)? How does this work? Since the 80286 is an entirely different beast on that part(as well as descriptor loading etc., which isn't documented at all).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 37 of 198, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Well what I did (and this is really the part where my emulator is not as exact as it could be) is look at the Intel timings in the datasheet for 8088 and extrapolated from those how much is really EU. The Intel timings are all implying data is available in the prefetch buffer. So after I subtract the EA timing and the memory in/out timing (4 cycles per byte) then I find the EU cycles for each instruction. And that is what I count as "execution only" timing, as in cycles when the EU is busy executing (excluding EA). For most instructions that is accurate but not for all.
We would really need to run Reeningne's sniffer and extrapolate from there the EU timings.

As for 286 I did the same thing (but use the 286 datasheet timing). That datasheet does give you protected vs real mode cycles which presumably deal with cases where the CPU has to deal with descriptors for example.

Since I physically have almost all early IBM machines in my collection it would be good to build my own bus sniffer (from Reengne's sheets) and then we can extract timing data for a wide variety of machines not just 5150/5160s. I'll open a separate thread to talk to you and Reenigne about this.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 38 of 198, by superfury

User metadata
Rank l33t++
Rank
l33t++

Are the PIQ fetches(1 cycle/instruction byte) also needed to be substracted? Or should just the EA timings and memory timings need to be substracted? Also, the references say X+EA cycles, so X only needs to be substracted by 4 cycles/memory access?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 39 of 198, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Are the PIQ fetches(1 cycle/instruction byte) also needed to be substracted?

Yes. And I do that. I have the concept of "wasted cycles" I think I call them and those I subtract from final EU time.

superfury wrote:

Or should just the EA timings and memory timings need to be substracted? Also, the references say X+EA cycles, so X only needs to be substracted by 4 cycles/memory access?

Yes. You need to subtract the 4 cycles/memory access from X.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/