VOGONS


First post, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

More specifically can the decoder decode a new opcode (or the EU unit execute it or the EA do some address calculation) while the BIU part of the processor is writing out memory on the bus (or waiting for the free bus cycle to write out data)?

Or will the decoder not attempt to read from prefetch buffer until all the writes are out of the CPU?

What about the 286 or 386?

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 1 of 7, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
vladstamate wrote:

More specifically can the decoder decode a new opcode (or the EU unit execute it or the EA do some address calculation) while the BIU part of the processor is writing out memory on the bus (or waiting for the free bus cycle to write out data)?

Or will the decoder not attempt to read from prefetch buffer until all the writes are out of the CPU?

I'm not 100% sure (having neither reverse-engineered these CPUs to gate level nor analysed every possible code sequence), but I'm pretty sure that on 8088 and 8086, code prefetch operations are the only BIU operations which happen asynchronously with the EU. Bus operations (reads or writes, memory or port space) that happen as the result of execution of an instruction happen during the execution of that instruction, and the execution of an instruction can't complete until all its bus operations are complete. In particular, this means that (because prefetches are always reads) the CPU never does anything else during a write. It can't even be doing EA calculations since there aren't any instructions that write to an address and then do an EA read/write to another address.

vladstamate wrote:

What about the 286 or 386?

No idea about those.

Reply 2 of 7, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Thank you reenigne.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 3 of 7, by superfury

User metadata
Rank l33t++
Rank
l33t++

Reenigne, your last post makes me remember something: I currently have the IN and OUT instructions to use 10(opcodes E4/E5/E6/E7) and 8(opcodes EC/ED/EE/EF) cycles. Does that include the cycles spent on the bus(I/O operation itself)? So do I still need to add 4 or 8 memory(actually I/O) cycles depending on alignment etc?

Edit: Looking at the documentation of the Intel 8086 Family User's Manual October 1979 it seems I've forgotten those timings. I've just implemented them in my emulator's IN/OUT instructions and now testing it against 8088 MPH...

Edit: It looks like 8088MPH performs a little bit better now: The kefrens bars are still not 100% correct(vertical timing that's handled by the CPU being cycle accurate: the background movement that's not determined by horizontal retrace timings), but it's more closer(vertical timing seems closer, showing horizontal lines of a single color, only the seperate scanlines(entire scanlines) had errors).

Filename
8088MPH_RasterBars_screencaptures.zip
File size
499.72 KiB
Downloads
32 downloads
File comment
Screenshots taken using the UniPCemu screencaptures (linear in time, by clicking the Cap(Screen capture) button).
File license
Fair use/fair dealing exception

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 7, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

[quote="superfury"]Reenigne, your last post makes me remember something: I currently have the IN and OUT instructions to use 10(opcodes E4/E5/E6/E7) and 8(opcodes EC/ED/EE/EF) cycles. Does that include the cycles spent on the bus(I/O operation itself)? So do I still need to add 4 or 8 memory(actually I/O) cycles depending on alignment etc?[/quotes]

It's included in the EU timings, but you need to make sure the BIU timings account for it as well. Also note that when accessing a port (as opposed to memory) there is always at least one cycle of wait state per access (added by the motherboard no matter what device is accessed). So the port read/write takes 5 cycles (10 for a 16-bit access on an 8-bit bus) instead of 4 (8).

Reply 5 of 7, by superfury

User metadata
Rank l33t++
Rank
l33t++

This is what I get after adjusting the MMUR cycles to 5 each access(on top of the 10 or 8 cycles already mentioned, so 10+5/10 and 8+5/10 cycles).

Filename
8088MPH_RasterBars_5cyclesPerMemory.zip
File size
1.22 MiB
Downloads
33 downloads
File comment
8088MPH at 5 cycles instead of 4 for each aligned word, unaligned word half and byte I/O access.
File license
Fair use/fair dealing exception

Edit: Reducing the 10 and 8 cycles with 5(which is included in the cycle count) improves performance again, with the following raster bar resulting:

Filename
8088MPH_RasterBars_5cyclesPerMemory_10_8_reduced.zip
File size
2.42 MiB
Downloads
30 downloads
File comment
10 cycles reduced to 5 and 8 cycles reduced to 3.
File license
Fair use/fair dealing exception

The value of MMUR is added to the normal cycle count(10 becoming 5 and 8 becoming 3):

//highaccess=0 for the first access, 1 for the second access
void CPU8086_addWordIOMemoryTiming(byte evenodd, byte highaccess)
{
if (EMULATED_CPU==CPU_8086) //808(6/8)?
{
if (CPU_databussize) //8088?
{
CPU[activeCPU].cycles_MMUR += 5; //Add 4 clocks with all 8/16-bit(as 8-bit) cycles on 8086!
}
else //8086?
{
if (!(evenodd && highaccess)) //Not odd address from even location?
{
CPU[activeCPU].cycles_MMUR += 5; //Add 4 clocks with odd cycles on 8086!
}
}
}
}

Evenodd is bit 0 of the accessed ports (e.g. a word access to an aligned address gives 0 and 1(reversed for unaligned word accesses) respectively and byte access only 0 or 1 depending on the port accessed).
Highaccess is 0 for the low part of the 16-bit port(always 0 for byte accesses) and 1 for the high part of the 16-bit port(e.g. OUT DX,AX with DX=0, it gives EvenOdd,Highaccess combinations of 0,0 and 1,1), but unaligned DX=1 gives 1,0(low half) and 0,1(high half)).

Thus it will result in 5 cycles being added during aligned word accesses or byte accesses, but 10 cycles during unaligned word accesses). This is of course added to another variable(cycles_OP) which contains 3 or 5 respectively to get the total amount of cycles spent on the instruction(of which, after substraction of the MMUR cycles, the resulting cycles being spent on prefetching (1 byte every 4 remaining cycles)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 7, by Scali

User metadata
Rank l33t
Rank
l33t
superfury wrote:

Thus it will result in 5 cycles being added during aligned word accesses or byte accesses, but 10 cycles during unaligned word accesses).

I don't think that's correct for the 8088?
The 8086 actually has the capability to access a word, but only if it's aligned. If it is unaligned, it breaks up the word access into two byte accesses (or actually two word accesses, of which it only takes one byte each).
The 8088 always breaks up word accesses in two byte accesses, and as far as I know, it is not sensitive to alignment at all because of this.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 7 of 7, by superfury

User metadata
Rank l33t++
Rank
l33t++

What you've said is already implemented if you look at the code:
- The CPU_databussize check ensures that all memory accesses(8-bit, 16-bit unaligned first part, 16-bit unaligned second part and 16-bit aligned whole) consume 5 cycles.
- Otherwise, it will always take 5 cycles for each byte(or 2 bytes in the case of the word).

Possible combinations of evenodd&highaccess:

Evenodd,Highaccess=What
0,0=Even access of aligned word access and even byte accesses(add 5 cycles)
0,1=Even access of unaligned word access (add 5 cycles)
1,0=Odd access of unaligned word access and odd byte accesses(add 5 cycles)
1,1=Odd access of aligned word access (don't add cycles on the 8086, adds 5 cycles on 8088)

Since on the 8086 the 1,1 case doesn't add cycles, this results in:

Even byte access: 0,0 case only(adds 5 cycles)
Odd byte access: 1,0 case only(adds 5 cycles)
Aligned word access: 0,0 case(adds 5 cycles) on first byte and 1,1 case(doesn't add cycles) on second byte
Unaligned word access: 1,0 case(adds 5 cycles) on first byte and 0,1 case(adds 5 cycles) on second byte.

Thus this results in the sums being added(on the 8086):

Even byte access: 5 cycles
Odd byte access: 5 cycles
Aligned word access: 5 cycles
Unaligned word access: 10 cycles

This results in the 5(previously 10) for the imm8 variant cycles becoming a total of(on the 8086):

Even byte access: 5+5=10 cycles
Odd byte access: 5+5=10 cycles
Aligned word access: 5+5=10 cycles
Unaligned word access: 5+10=15 cycles

And for the DX variant(3 cycles, previously 8 ):

Even byte access: 3+5=8 cycles
Odd byte access: 3+5=8 cycles
Aligned word access: 3+5=8 cycles
Unaligned word access: 3+10=13 cycles

For the 8088, all cases consume 5 cycles, so:

Even byte acccess: 5 cycles
Odd byte accesses: 5 cycles
Aligned word access: 5+5=10 cycles
Unaligned word acccess: 5+5=10 cycles

Thus the imm8 variant results in (on the 8088 only):

Even byte access: 5+5=10 cycles
Odd byte access: 5+5=10 cycles
Aligned word access: 5+10=15 cycles
Unaligned word access: 5+10=15 cycles

And the DX variant results in (on the 8088 only):

Even byte access: 3+5=8 cycles
Odd byte access: 3+5=8 cycles
Aligned word access: 3+10=13 cycles
Unaligned word access: 3+10=13 cycles

This can also be seen by looking at the calls to the actual CPU PORT IN/OUT functionality(in cpu.c):

void CPU_PORT_OUT_B(word port, byte data)
{
...
CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit)!
}

void CPU_PORT_OUT_W(word port, word data)
{
...
CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit when needed)!
++port; //Check the high port as well!
CPU8086_addWordIOMemoryTiming(port&1,1); //High I/O access of I/O only(8-bit when needed)!
}

void CPU_PORT_IN_B(word port, byte *result)
{
...
CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit)!
}

void CPU_PORT_IN_W(word port, word *result)
{
...
CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit when needed)!
++port; //Check the high port as well!
CPU8086_addWordIOMemoryTiming(port&1,1); //High I/O access of I/O only(8-bit when needed)!
}

This creates those correct aligned/unaligned timings, which are added to the base timing of the instruction(3 for DX, 5 for imm8). Since the 8088 will always add 5 cycles on every byte port and both low and high bytes of a word port(ignoring the parameters), it will always result in 5 cycles added for byte accesses and 10 cycles added for word accesses.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io