More specifically can the decoder decode a new opcode (or the EU unit execute it or the EA do some address calculation) while the BIU part of the processor is writing out memory on the bus (or waiting for the free bus cycle to write out data)?
Or will the decoder not attempt to read from prefetch buffer until all the writes are out of the CPU?
More specifically can the decoder decode a new opcode (or the EU unit execute it or the EA do some address calculation) while the BIU part of the processor is writing out memory on the bus (or waiting for the free bus cycle to write out data)?
Or will the decoder not attempt to read from prefetch buffer until all the writes are out of the CPU?
I'm not 100% sure (having neither reverse-engineered these CPUs to gate level nor analysed every possible code sequence), but I'm pretty sure that on 8088 and 8086, code prefetch operations are the only BIU operations which happen asynchronously with the EU. Bus operations (reads or writes, memory or port space) that happen as the result of execution of an instruction happen during the execution of that instruction, and the execution of an instruction can't complete until all its bus operations are complete. In particular, this means that (because prefetches are always reads) the CPU never does anything else during a write. It can't even be doing EA calculations since there aren't any instructions that write to an address and then do an EA read/write to another address.
Reenigne, your last post makes me remember something: I currently have the IN and OUT instructions to use 10(opcodes E4/E5/E6/E7) and 8(opcodes EC/ED/EE/EF) cycles. Does that include the cycles spent on the bus(I/O operation itself)? So do I still need to add 4 or 8 memory(actually I/O) cycles depending on alignment etc?
Edit: Looking at the documentation of the Intel 8086 Family User's Manual October 1979 it seems I've forgotten those timings. I've just implemented them in my emulator's IN/OUT instructions and now testing it against 8088 MPH...
Edit: It looks like 8088MPH performs a little bit better now: The kefrens bars are still not 100% correct(vertical timing that's handled by the CPU being cycle accurate: the background movement that's not determined by horizontal retrace timings), but it's more closer(vertical timing seems closer, showing horizontal lines of a single color, only the seperate scanlines(entire scanlines) had errors).
[quote="superfury"]Reenigne, your last post makes me remember something: I currently have the IN and OUT instructions to use 10(opcodes E4/E5/E6/E7) and 8(opcodes EC/ED/EE/EF) cycles. Does that include the cycles spent on the bus(I/O operation itself)? So do I still need to add 4 or 8 memory(actually I/O) cycles depending on alignment etc?[/quotes]
It's included in the EU timings, but you need to make sure the BIU timings account for it as well. Also note that when accessing a port (as opposed to memory) there is always at least one cycle of wait state per access (added by the motherboard no matter what device is accessed). So the port read/write takes 5 cycles (10 for a 16-bit access on an 8-bit bus) instead of 4 (8).
The value of MMUR is added to the normal cycle count(10 becoming 5 and 8 becoming 3):
1//highaccess=0 for the first access, 1 for the second access 2void CPU8086_addWordIOMemoryTiming(byte evenodd, byte highaccess) 3{ 4 if (EMULATED_CPU==CPU_8086) //808(6/8)? 5 { 6 if (CPU_databussize) //8088? 7 { 8 CPU[activeCPU].cycles_MMUR += 5; //Add 4 clocks with all 8/16-bit(as 8-bit) cycles on 8086! 9 } 10 else //8086? 11 { 12 if (!(evenodd && highaccess)) //Not odd address from even location? 13 { 14 CPU[activeCPU].cycles_MMUR += 5; //Add 4 clocks with odd cycles on 8086! 15 } 16 } 17 } 18}
Evenodd is bit 0 of the accessed ports (e.g. a word access to an aligned address gives 0 and 1(reversed for unaligned word accesses) respectively and byte access only 0 or 1 depending on the port accessed).
Highaccess is 0 for the low part of the 16-bit port(always 0 for byte accesses) and 1 for the high part of the 16-bit port(e.g. OUT DX,AX with DX=0, it gives EvenOdd,Highaccess combinations of 0,0 and 1,1), but unaligned DX=1 gives 1,0(low half) and 0,1(high half)).
Thus it will result in 5 cycles being added during aligned word accesses or byte accesses, but 10 cycles during unaligned word accesses). This is of course added to another variable(cycles_OP) which contains 3 or 5 respectively to get the total amount of cycles spent on the instruction(of which, after substraction of the MMUR cycles, the resulting cycles being spent on prefetching (1 byte every 4 remaining cycles)).
Thus it will result in 5 cycles being added during aligned word accesses or byte accesses, but 10 cycles during unaligned word accesses).
I don't think that's correct for the 8088?
The 8086 actually has the capability to access a word, but only if it's aligned. If it is unaligned, it breaks up the word access into two byte accesses (or actually two word accesses, of which it only takes one byte each).
The 8088 always breaks up word accesses in two byte accesses, and as far as I know, it is not sensitive to alignment at all because of this.
What you've said is already implemented if you look at the code:
- The CPU_databussize check ensures that all memory accesses(8-bit, 16-bit unaligned first part, 16-bit unaligned second part and 16-bit aligned whole) consume 5 cycles.
- Otherwise, it will always take 5 cycles for each byte(or 2 bytes in the case of the word).
Possible combinations of evenodd&highaccess:
1Evenodd,Highaccess=What 20,0=Even access of aligned word access and even byte accesses(add 5 cycles) 30,1=Even access of unaligned word access (add 5 cycles) 41,0=Odd access of unaligned word access and odd byte accesses(add 5 cycles) 51,1=Odd access of aligned word access (don't add cycles on the 8086, adds 5 cycles on 8088)
Since on the 8086 the 1,1 case doesn't add cycles, this results in:
1Even byte access: 0,0 case only(adds 5 cycles) 2Odd byte access: 1,0 case only(adds 5 cycles) 3Aligned word access: 0,0 case(adds 5 cycles) on first byte and 1,1 case(doesn't add cycles) on second byte 4Unaligned word access: 1,0 case(adds 5 cycles) on first byte and 0,1 case(adds 5 cycles) on second byte.
Thus this results in the sums being added(on the 8086):
1Even byte access: 5 cycles 2Odd byte access: 5 cycles 3Aligned word access: 5 cycles 4Unaligned word access: 10 cycles
This results in the 5(previously 10) for the imm8 variant cycles becoming a total of(on the 8086):
1Even byte access: 5+5=10 cycles 2Odd byte access: 5+5=10 cycles 3Aligned word access: 5+5=10 cycles 4Unaligned word access: 5+10=15 cycles
And for the DX variant(3 cycles, previously 8 ):
1Even byte access: 3+5=8 cycles 2Odd byte access: 3+5=8 cycles 3Aligned word access: 3+5=8 cycles 4Unaligned word access: 3+10=13 cycles
For the 8088, all cases consume 5 cycles, so:
1Even byte acccess: 5 cycles 2Odd byte accesses: 5 cycles 3Aligned word access: 5+5=10 cycles 4Unaligned word acccess: 5+5=10 cycles
Thus the imm8 variant results in (on the 8088 only):
1Even byte access: 5+5=10 cycles 2Odd byte access: 5+5=10 cycles 3Aligned word access: 5+10=15 cycles 4Unaligned word access: 5+10=15 cycles
And the DX variant results in (on the 8088 only):
1Even byte access: 3+5=8 cycles 2Odd byte access: 3+5=8 cycles 3Aligned word access: 3+10=13 cycles 4Unaligned word access: 3+10=13 cycles
This can also be seen by looking at the calls to the actual CPU PORT IN/OUT functionality(in cpu.c):
1void CPU_PORT_OUT_B(word port, byte data) 2{ 3... 4 CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit)! 5} 6 7void CPU_PORT_OUT_W(word port, word data) 8{ 9... 10 CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit when needed)! 11 ++port; //Check the high port as well! 12 CPU8086_addWordIOMemoryTiming(port&1,1); //High I/O access of I/O only(8-bit when needed)! 13} 14 15void CPU_PORT_IN_B(word port, byte *result) 16{ 17... 18 CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit)! 19} 20 21void CPU_PORT_IN_W(word port, word *result) 22{ 23... 24 CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit when needed)! 25 ++port; //Check the high port as well! 26 CPU8086_addWordIOMemoryTiming(port&1,1); //High I/O access of I/O only(8-bit when needed)! 27}
This creates those correct aligned/unaligned timings, which are added to the base timing of the instruction(3 for DX, 5 for imm8). Since the 8088 will always add 5 cycles on every byte port and both low and high bytes of a word port(ignoring the parameters), it will always result in 5 cycles added for byte accesses and 10 cycles added for word accesses.