Is the 8088/86 pipelined? \ VOGONS

Is the 8088/86 pipelined?

Topic actions

Post a reply

First post, by vladstamate

Posted on 2016-11-04, 14:48

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

More specifically can the decoder decode a new opcode (or the EU unit execute it or the EA do some address calculation) while the BIU part of the processor is writing out memory on the bus (or waiting for the free bus cycle to write out data)?

Or will the decoder not attempt to read from prefetch buffer until all the writes are out of the CPU?

What about the 286 or 386?

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 1 of 7, by reenigne

Posted on 2016-11-04, 19:42

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

vladstamate wrote:
More specifically can the decoder decode a new opcode (or the EU unit execute it or the EA do some address calculation) while the BIU part of the processor is writing out memory on the bus (or waiting for the free bus cycle to write out data)?

Or will the decoder not attempt to read from prefetch buffer until all the writes are out of the CPU?

I'm not 100% sure (having neither reverse-engineered these CPUs to gate level nor analysed every possible code sequence), but I'm pretty sure that on 8088 and 8086, code prefetch operations are the only BIU operations which happen asynchronously with the EU. Bus operations (reads or writes, memory or port space) that happen as the result of execution of an instruction happen during the execution of that instruction, and the execution of an instruction can't complete until all its bus operations are complete. In particular, this means that (because prefetches are always reads) the CPU never does anything else during a write. It can't even be doing EA calculations since there aren't any instructions that write to an address and then do an EA read/write to another address.

vladstamate wrote:
What about the 286 or 386?

No idea about those.

Reply 2 of 7, by vladstamate

Posted on 2016-11-08, 23:03

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

Thank you reenigne.

Reply 3 of 7, by superfury

Posted on 2016-11-09, 08:49

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5472
Joined: 2014-03-08, 11:25
Location: Netherlands

Reenigne, your last post makes me remember something: I currently have the IN and OUT instructions to use 10(opcodes E4/E5/E6/E7) and 8(opcodes EC/ED/EE/EF) cycles. Does that include the cycles spent on the bus(I/O operation itself)? So do I still need to add 4 or 8 memory(actually I/O) cycles depending on alignment etc?

Edit: Looking at the documentation of the Intel 8086 Family User's Manual October 1979 it seems I've forgotten those timings. I've just implemented them in my emulator's IN/OUT instructions and now testing it against 8088 MPH...

Edit: It looks like 8088MPH performs a little bit better now: The kefrens bars are still not 100% correct(vertical timing that's handled by the CPU being cycle accurate: the background movement that's not determined by horizontal retrace timings), but it's more closer(vertical timing seems closer, showing horizontal lines of a single color, only the seperate scanlines(entire scanlines) had errors).

Filename: 8088MPH_RasterBars_screencaptures.zip
File size: 499.72 KiB
Downloads: 32 downloads
File comment: Screenshots taken using the UniPCemu screencaptures (linear in time, by clicking the Cap(Screen capture) button).
File license: Fair use/fair dealing exception

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 7, by reenigne

Posted on 2016-11-10, 09:06

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

[quote="superfury"]Reenigne, your last post makes me remember something: I currently have the IN and OUT instructions to use 10(opcodes E4/E5/E6/E7) and 8(opcodes EC/ED/EE/EF) cycles. Does that include the cycles spent on the bus(I/O operation itself)? So do I still need to add 4 or 8 memory(actually I/O) cycles depending on alignment etc?[/quotes]

It's included in the EU timings, but you need to make sure the BIU timings account for it as well. Also note that when accessing a port (as opposed to memory) there is always at least one cycle of wait state per access (added by the motherboard no matter what device is accessed). So the port read/write takes 5 cycles (10 for a 16-bit access on an 8-bit bus) instead of 4 (8).

Reply 5 of 7, by superfury

Posted on 2016-11-10, 10:20

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5472
Joined: 2014-03-08, 11:25
Location: Netherlands

This is what I get after adjusting the MMUR cycles to 5 each access(on top of the 10 or 8 cycles already mentioned, so 10+5/10 and 8+5/10 cycles).

Filename: 8088MPH_RasterBars_5cyclesPerMemory.zip
File size: 1.22 MiB
Downloads: 33 downloads
File comment: 8088MPH at 5 cycles instead of 4 for each aligned word, unaligned word half and byte I/O access.
File license: Fair use/fair dealing exception

Edit: Reducing the 10 and 8 cycles with 5(which is included in the cycle count) improves performance again, with the following raster bar resulting:

Filename: 8088MPH_RasterBars_5cyclesPerMemory_10_8_reduced.zip
File size: 2.42 MiB
Downloads: 30 downloads
File comment: 10 cycles reduced to 5 and 8 cycles reduced to 3.
File license: Fair use/fair dealing exception

The value of MMUR is added to the normal cycle count(10 becoming 5 and 8 becoming 3):

1//highaccess=0 for the first access, 1 for the second access
2void CPU8086_addWordIOMemoryTiming(byte evenodd, byte highaccess)
3{
4	if (EMULATED_CPU==CPU_8086) //808(6/8)?
5	{
6		if (CPU_databussize) //8088?
7		{
8			CPU[activeCPU].cycles_MMUR += 5; //Add 4 clocks with all 8/16-bit(as 8-bit) cycles on 8086!
9		}
10		else //8086?
11		{
12			if (!(evenodd && highaccess)) //Not odd address from even location?
13			{
14				CPU[activeCPU].cycles_MMUR += 5; //Add 4 clocks with odd cycles on 8086!
15			}
16		}
17	}
18}

Evenodd is bit 0 of the accessed ports (e.g. a word access to an aligned address gives 0 and 1(reversed for unaligned word accesses) respectively and byte access only 0 or 1 depending on the port accessed).
Highaccess is 0 for the low part of the 16-bit port(always 0 for byte accesses) and 1 for the high part of the 16-bit port(e.g. OUT DX,AX with DX=0, it gives EvenOdd,Highaccess combinations of 0,0 and 1,1), but unaligned DX=1 gives 1,0(low half) and 0,1(high half)).

Thus it will result in 5 cycles being added during aligned word accesses or byte accesses, but 10 cycles during unaligned word accesses). This is of course added to another variable(cycles_OP) which contains 3 or 5 respectively to get the total amount of cycles spent on the instruction(of which, after substraction of the MMUR cycles, the resulting cycles being spent on prefetching (1 byte every 4 remaining cycles)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 7, by Scali

Posted on 2016-11-10, 11:06

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

superfury wrote:
Thus it will result in 5 cycles being added during aligned word accesses or byte accesses, but 10 cycles during unaligned word accesses).

I don't think that's correct for the 8088?
The 8086 actually has the capability to access a word, but only if it's aligned. If it is unaligned, it breaks up the word access into two byte accesses (or actually two word accesses, of which it only takes one byte each).
The 8088 always breaks up word accesses in two byte accesses, and as far as I know, it is not sensitive to alignment at all because of this.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 7 of 7, by superfury

Posted on 2016-11-10, 14:02

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5472
Joined: 2014-03-08, 11:25
Location: Netherlands

What you've said is already implemented if you look at the code:
- The CPU_databussize check ensures that all memory accesses(8-bit, 16-bit unaligned first part, 16-bit unaligned second part and 16-bit aligned whole) consume 5 cycles.
- Otherwise, it will always take 5 cycles for each byte(or 2 bytes in the case of the word).

Possible combinations of evenodd&highaccess:

1Evenodd,Highaccess=What
20,0=Even access of aligned word access and even byte accesses(add 5 cycles)
30,1=Even access of unaligned word access (add 5 cycles)
41,0=Odd access of unaligned word access and odd byte accesses(add 5 cycles)
51,1=Odd access of aligned word access (don't add cycles on the 8086, adds 5 cycles on 8088)

Since on the 8086 the 1,1 case doesn't add cycles, this results in:

1Even byte access: 0,0 case only(adds 5 cycles)
2Odd byte access: 1,0 case only(adds 5 cycles)
3Aligned word access: 0,0 case(adds 5 cycles) on first byte and 1,1 case(doesn't add cycles) on second byte
4Unaligned word access: 1,0 case(adds 5 cycles) on first byte and 0,1 case(adds 5 cycles) on second byte.

Thus this results in the sums being added(on the 8086):

1Even byte access: 5 cycles
2Odd byte access: 5 cycles
3Aligned word access: 5 cycles
4Unaligned word access: 10 cycles

This results in the 5(previously 10) for the imm8 variant cycles becoming a total of(on the 8086):

1Even byte access: 5+5=10 cycles
2Odd byte access: 5+5=10 cycles
3Aligned word access: 5+5=10 cycles
4Unaligned word access: 5+10=15 cycles

And for the DX variant(3 cycles, previously 8 ):

1Even byte access: 3+5=8 cycles
2Odd byte access: 3+5=8 cycles
3Aligned word access: 3+5=8 cycles
4Unaligned word access: 3+10=13 cycles

For the 8088, all cases consume 5 cycles, so:

1Even byte acccess: 5 cycles
2Odd byte accesses: 5 cycles
3Aligned word access: 5+5=10 cycles
4Unaligned word acccess: 5+5=10 cycles

Thus the imm8 variant results in (on the 8088 only):

1Even byte access: 5+5=10 cycles
2Odd byte access: 5+5=10 cycles
3Aligned word access: 5+10=15 cycles
4Unaligned word access: 5+10=15 cycles

And the DX variant results in (on the 8088 only):

1Even byte access: 3+5=8 cycles
2Odd byte access: 3+5=8 cycles
3Aligned word access: 3+10=13 cycles
4Unaligned word access: 3+10=13 cycles

This can also be seen by looking at the calls to the actual CPU PORT IN/OUT functionality(in cpu.c):

1void CPU_PORT_OUT_B(word port, byte data)
2{
3...
4	CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit)!
5}
6
7void CPU_PORT_OUT_W(word port, word data)
8{
9...
10	CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit when needed)!
11	++port; //Check the high port as well!
12	CPU8086_addWordIOMemoryTiming(port&1,1); //High I/O access of I/O only(8-bit when needed)!
13}
14
15void CPU_PORT_IN_B(word port, byte *result)
16{
17...
18	CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit)!
19}
20
21void CPU_PORT_IN_W(word port, word *result)
22{
23...
24	CPU8086_addWordIOMemoryTiming(port&1,0); //Low I/O access of I/O only(8-bit when needed)!
25	++port; //Check the high port as well!
26	CPU8086_addWordIOMemoryTiming(port&1,1); //High I/O access of I/O only(8-bit when needed)!
27}

This creates those correct aligned/unaligned timings, which are added to the base timing of the instruction(3 for DX, 5 for imm8). Since the 8088 will always add 5 cycles on every byte port and both low and high bytes of a word port(ignoring the parameters), it will always result in 5 cycles added for byte accesses and 10 cycles added for word accesses.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Go to top of page Go to top of page

Back to PC Emulation

Main menu

Common searches