VOGONS


First post, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

I am not sure if we talked about this before, but I assume that in a given cycle the processor can read more than 1 byte from the BIU prefetched data. In fact it can read 16bits if prior to 386 and 32bits otherwise. So I am thinking it is like this:

8088/8086 - can read 16bits (2 bytes) per cycle from prefetch queued (data that is already prefetched)
80286 - 16bits (2 bytes)
80386SX and DX - 32bit (4 bytes).

Why is this important? Well if I have an instruction that say requires an immediate, if that immediate is 16 bits then we have the following number of cycles to process (*)

8088 - 8 cycles to fetch 16bits and 1 cycle to read it from BIU
8086 - 4 cycles to fetch 16bits and 1 cycle to read it from BIU
80286 - 2 cycles to fetch 16bits and 1 cycle to read it from BIU

If I have a 32bit immediate in the instruction a 80386 (DX or SX) will take that long

80386SX - 2 cycles to fetch 32bits and 1 cycle to read it from BIU
80386DX - 1 cycles to fetch 32bits and 1 cycle to read it from BIU

Is my assumption correct?

(*) Under ideal circumstances, excluding waitstats and other bus operations that might stop the bus like DMA, etc.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 1 of 17, by superfury

User metadata
Rank l33t++
Rank
l33t++

I think that's mostly correct for plain MMU/IO accesses, but the PIQ is always read in 1 byte/cycle, while it might fetch words or dwords from memory into the PIQ, but I'm not 100% sure(especially since there exist cases like fetching offset 0xFFFF in 16-bit mode)?

Edit: Also, a 80386DX+ still fetches data in 2 cycle minimal, for any byte/aligned (d)word?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 2 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Edit: Also, a 80386DX+ still fetches data in 2 cycle minimal, for any byte/aligned (d)word?

Why would it be 2? 32bit bus, reads one aligned dword in 1 cycle. Why 2? Not account for WS, as those can be 0.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 3 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

From 386 manual, page 24

When used in a configuration with a 32-bit bus, actual transfers of data between processor and memory take place in units of doublewords beginning at addresses evenly divisible by four;

So 1 cycle.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 4 of 17, by superfury

User metadata
Rank l33t++
Rank
l33t++

So, essentially for small instructions taking a bit of time(4 cycles) will always have the prefetch non-empty due to the ridiculously fast memory access times(4 prefetch bytes in 1 cycle, then up to 4 cycles to read into the EU)? Thus 5 cycles for 4 bytes or simply 1.25 cycle per instruction byte or lower for small instructions?

Also, that only explains the fetches themselves, nothing is said about it being 1-cycle?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Also, that only explains the fetches themselves, nothing is said about it being 1-cycle?

My comment of 1 cycle for 386DX is about the prefetch itself. Why would it be not 1 cycle? Where is another cycle wasted on?

vladstamate wrote:

80386DX - 1 cycles to fetch 32bits and 1 cycle to read it from BIU

I also doubt the 386 reads 1 byte at a time from the prefetched data buffer. That would be terribly unperformant. My guess is it either reads as much as it needs in 1 cycle (up to 4 bytes) or (much less likely) up to 2 bytes.

The question is what does 386SX do, and also what doe 286 do when they read from prefetched data buffer. Again, my guess is 386SX can read 4 bytes and 286 can read 2 bytes at a time (if necessary). Alignment does not matter anymore as we are reading from prefetch buffer, not actual memory.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 6 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

So, essentially for small instructions taking a bit of time(4 cycles) will always have the prefetch non-empty due to the ridiculously fast memory access times(4 prefetch bytes in 1 cycle, then up to 4 cycles to read into the EU)? Thus 5 cycles for 4 bytes or simply 1.25 cycle per instruction byte or lower for small instructions?

I think that is true. In reality however you might have other things contending with the bus and also 386 has WS 1 (or more) memory IIRC.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 7 of 17, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

The 8088 and 8086 cannot remove more than one byte from the prefetch queue in any given cycle. We know this because the 8087 needs to track the state of the prefetch queue and the interface it uses to do so doesn't have any way to signal more than one byte removed in any given cycle.

Reply 8 of 17, by superfury

User metadata
Rank l33t++
Rank
l33t++

Does that one-byte at a time way of fetching instructions from the PIQ happen with later (286+) processors as well? So even during 16-bit and 32-bit immediate operands of an instruction(e.g. ADD EAX,00000001h with opcode 05h)? And the data is fetched into the PIQ from RAM/MMIO in dword quantities when possible(dword aligned address) or word quantities(word alignment, then followed by dword quantities) else byte quantities(byte alignment, after that word or dword quantities depending on alignment)? Also, there's the whole case of protection involved as well, so how does that all combine? Say you are at a dword aligned address, but the fourth byte is outside segment limits or outside paging limits, what happens? Three bytes fetched, fourth cleared? Entire fetch aborted as an error? Word fetch followed by byte fetch, then fetching is stopped?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 17, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Does that one-byte at a time way of fetching instructions from the PIQ happen with later (286+) processors as well?

The 286 doesn't have the queue status interface that 8088/8086/80188/80186 have, so I have no way to know the size of the remove-from-prefetch-queue bus. I'd expect they'd have made it 16 bits by the time of the 80286 and 32 bits by the time of the 80386, but that's just a guess.

Reply 10 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
reenigne wrote:
superfury wrote:

Does that one-byte at a time way of fetching instructions from the PIQ happen with later (286+) processors as well?

The 286 doesn't have the queue status interface that 8088/8086/80188/80186 have, so I have no way to know the size of the remove-from-prefetch-queue bus. I'd expect they'd have made it 16 bits by the time of the 80286 and 32 bits by the time of the 80386, but that's just a guess.

Yes, no way 386 did get 1 byte at a time. Yes, there are all the complications about protections faults and unaligned reads and so on. But in the ideal circumstances when none of that happens if the EU needs a 32bit quantity for a 32bit ADD EAX, imm lets say, then I suspect it will just swoop that in 1 go if all of it is in the queue.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 11 of 17, by Jepael

User metadata
Rank Oldbie
Rank
Oldbie

It is more complex than just about how many bytes per cycle are fetched from PIQ.

80286 is more pipelined than 8086 was. On a 286, there is a instruction unit (IU) between execution unit (EU) and bus interface unit (BIU).

The IU fetches opcode data from BIU instruction queue, decodes them, and stores decoded instructions in another queue for execution.

That queue can hold three decoded instructions.

The EU then fetches the decoded instructions and executes them.

Reply 12 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Ohh, interesting! I did not account for that. Do you happen to have a link or a document where I can find more about this?

This brings even more questions, like is the IU fetching the immediate values (of any) as part of its decoding?

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 13 of 17, by Jepael

User metadata
Rank Oldbie
Rank
Oldbie
vladstamate wrote:

Ohh, interesting! I did not account for that. Do you happen to have a link or a document where I can find more about this?

This brings even more questions, like is the IU fetching the immediate values (of any) as part of its decoding?

Not much more info, just wikipedia and intel manuals.

Apparently EA calculation is performed in a dedicated unit as well, so it's not done on ALU any more.
That's also obvious from block diagram; there is a block called addressing unit (AU) as well.

https://en.wikipedia.org/wiki/Intel_80286
https://archive.org/details/bitsavers_intel80 … al1987_14090554, figure 3-1

Reply 14 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Thank you Jepael, this is very good information.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 15 of 17, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
Jepael wrote:

Apparently EA calculation is performed in a dedicated unit as well, so it's not done on ALU any more.
That's also obvious from block diagram; there is a block called addressing unit (AU) as well.

In the light of what you said, this makes sense: that is probably to help with the pipelining. I assume so that IU can do EA as part of decoding while the EU is executing instructions (which might require the ALU).

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 16 of 17, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've just modified the instruction fetching to be allowing single-cycle word/dword fetches from the PIQ, as well as updated my BIU emulation to be able to fetch 16-bit and 32-bit quantities from memory(as long as it's aligned, although I'm not 100% sure the check for odd 16/32-bit subunits is going 100% OK):

Fetching process itself:

byte PIQ_block = 0; //Blocking any PIQ access now?
OPTINLINE void CPU_fillPIQ() //Fill the PIQ until it's full!
{
uint_32 realaddress;
if ((PIQ_block==1) || (PIQ_block==9)) { PIQ_block = 0; return; /* Blocked access: only fetch one byte/word instead of a full word/dword! */ }
if (unlikely(BIU[activeCPU].PIQ==0)) return; //Not gotten a PIQ? Abort!
realaddress = BIU[activeCPU].PIQ_Address; //Next address to fetch!
checkMMUaccess_linearaddr = (CPU[activeCPU].SEG_base[CPU_SEGMENT_CS]+realaddress); //Default 8086-compatible address to use, otherwise, it's overwritten by checkMMUaccess with the proper linear address!
if (unlikely(checkMMUaccess(CPU_SEGMENT_CS,CPU[activeCPU].registers->CS,realaddress,0x10|3,getCPL(),0,0))) return; //Abort on fault!
if (unlikely(is_paging())) //Are we paging?
{
checkMMUaccess_linearaddr = mappage(checkMMUaccess_linearaddr,0,getCPL()); //Map it using the paging mechanism!
}
writefifobuffer(BIU[activeCPU].PIQ, BIU_directrb(checkMMUaccess_linearaddr,0)); //Add the next byte from memory into the buffer!
if (unlikely(checkMMUaccess_linearaddr&1)) //Read an odd address?
{
PIQ_block &= 5; //Start blocking when it's 3(byte fetch instead of word fetch), also include dword odd addresses. Otherwise, continue as normally!
}
++BIU[activeCPU].PIQ_Address; //Increase the address to the next location!
//Next data! Take 4 cycles on 8088, 2 on 8086 when loading words/4 on 8086 when loading a single byte.
}

Main calling, which fetches bytes/words/dwords in one go(in byte subchunks, checking each one of them until an error occurs or misalignment(caused by PIQ_block cleared bits):

								PIQ_RequiredSize = 1; //Minimum of 2 bytes required for a fetch to happen!
PIQ_CurrentBlockSize = 3; //We're blocking after 1 byte access when at an odd address!
if (EMULATED_CPU>=CPU_80386) //386+?
{
PIQ_RequiredSize |= 2; //Minimum of 4 bytes required for a fetch to happen!
PIQ_CurrentBlockSize |= 4; //Apply 32-bit quantities as well, when allowed!
}
if (BIU_processRequests(memory_waitstates)) //Processing a request?
{
BIU[activeCPU].requestready = 0; //We're starting a request!
++BIU[activeCPU].prefetchclock; //Tick!
}
else if (fifobuffer_freesize(BIU[activeCPU].PIQ)>PIQ_RequiredSize) //Prefetch cycle when not requests are handled(2 free spaces only)? Else, NOP cycle!
{
CPU[activeCPU].BUSactive = 1; //Start memory cycles!
PIQ_block = PIQ_CurrentBlockSize; //We're blocking after 1 byte access when at an odd address at an odd word/dword address!
CPU_fillPIQ(); CPU_fillPIQ(); //Add a word to the prefetch!
if (PIQ_RequiredSize&2) //DWord access, when allowed?
{
CPU_fillPIQ(); CPU_fillPIQ(); //Add another word to the prefetch!
}
++CPU[activeCPU].cycles_Prefetch_BIU; //Cycles spent on prefetching on BIU idle time!
BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch!
BIU[activeCPU].requestready = 0; //We're starting a request!
++BIU[activeCPU].prefetchclock; //Tick!
}
else //Nothing to do?
{
BIU[activeCPU].stallingBUS = 2; //Stalling!
}

Edit: Fixed up the data transfer size on 32-bit quantities a bit:

									if ((PIQ_RequiredSize&2) && ((EMULATED_CPU>=CPU_80386) && (CPU_databussize==0))) //DWord access on a 32-bit BUS, when allowed?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 17 of 17, by superfury

User metadata
Rank l33t++
Rank
l33t++

A little update on the current 80386 timings(with the new DX/SX behaviour and PIQ fetching, as implemented in the above post):

456-Compaq Deskpro 386 16MHz 80386DX accuracy.jpg
Filename
456-Compaq Deskpro 386 16MHz 80386DX accuracy.jpg
File size
102.01 KiB
Views
987 views
File comment
Compaq Deskpro 386 - 386DX running MIPS 1.10
File license
Fair use/fair dealing exception
457-Compaq Deskpro 386 16MHz 80386SX accuracy.jpg
Filename
457-Compaq Deskpro 386 16MHz 80386SX accuracy.jpg
File size
102.04 KiB
Views
987 views
File comment
Compaq Deskpro 386 - 386SX running MIPS 1.10
File license
Fair use/fair dealing exception

Of course the same kind of improvements happen on the 80286 CPU, which is also a bit faster due to the new word fetching from memory and from PIQ.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io