BIU - EU prefetch interface for all x86

Emulation of old PCs, PC hardware, or PC peripherals.

BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-09 @ 18:04

I am not sure if we talked about this before, but I assume that in a given cycle the processor can read more than 1 byte from the BIU prefetched data. In fact it can read 16bits if prior to 386 and 32bits otherwise. So I am thinking it is like this:

8088/8086 - can read 16bits (2 bytes) per cycle from prefetch queued (data that is already prefetched)
80286 - 16bits (2 bytes)
80386SX and DX - 32bit (4 bytes).

Why is this important? Well if I have an instruction that say requires an immediate, if that immediate is 16 bits then we have the following number of cycles to process (*)

8088 - 8 cycles to fetch 16bits and 1 cycle to read it from BIU
8086 - 4 cycles to fetch 16bits and 1 cycle to read it from BIU
80286 - 2 cycles to fetch 16bits and 1 cycle to read it from BIU

If I have a 32bit immediate in the instruction a 80386 (DX or SX) will take that long

80386SX - 2 cycles to fetch 32bits and 1 cycle to read it from BIU
80386DX - 1 cycles to fetch 32bits and 1 cycle to read it from BIU

Is my assumption correct?

(*) Under ideal circumstances, excluding waitstats and other bus operations that might stop the bus like DMA, etc.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby superfury » 2017-10-09 @ 19:39

I think that's mostly correct for plain MMU/IO accesses, but the PIQ is always read in 1 byte/cycle, while it might fetch words or dwords from memory into the PIQ, but I'm not 100% sure(especially since there exist cases like fetching offset 0xFFFF in 16-bit mode)?

Edit: Also, a 80386DX+ still fetches data in 2 cycle minimal, for any byte/aligned (d)word?
superfury
l33t
 
Posts: 2048
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-09 @ 20:01

superfury wrote:Edit: Also, a 80386DX+ still fetches data in 2 cycle minimal, for any byte/aligned (d)word?


Why would it be 2? 32bit bus, reads one aligned dword in 1 cycle. Why 2? Not account for WS, as those can be 0.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-09 @ 20:39

From 386 manual, page 24

Code: Select all
When used in a configuration with a 32-bit bus, actual transfers of data between processor and memory take place in units of doublewords beginning at addresses evenly divisible by four;


So 1 cycle.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby superfury » 2017-10-09 @ 20:53

So, essentially for small instructions taking a bit of time(4 cycles) will always have the prefetch non-empty due to the ridiculously fast memory access times(4 prefetch bytes in 1 cycle, then up to 4 cycles to read into the EU)? Thus 5 cycles for 4 bytes or simply 1.25 cycle per instruction byte or lower for small instructions?

Also, that only explains the fetches themselves, nothing is said about it being 1-cycle?
superfury
l33t
 
Posts: 2048
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-09 @ 21:28

superfury wrote:Also, that only explains the fetches themselves, nothing is said about it being 1-cycle?


My comment of 1 cycle for 386DX is about the prefetch itself. Why would it be not 1 cycle? Where is another cycle wasted on?

vladstamate wrote:80386DX - 1 cycles to fetch 32bits and 1 cycle to read it from BIU


I also doubt the 386 reads 1 byte at a time from the prefetched data buffer. That would be terribly unperformant. My guess is it either reads as much as it needs in 1 cycle (up to 4 bytes) or (much less likely) up to 2 bytes.

The question is what does 386SX do, and also what doe 286 do when they read from prefetched data buffer. Again, my guess is 386SX can read 4 bytes and 286 can read 2 bytes at a time (if necessary). Alignment does not matter anymore as we are reading from prefetch buffer, not actual memory.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-09 @ 22:08

superfury wrote:So, essentially for small instructions taking a bit of time(4 cycles) will always have the prefetch non-empty due to the ridiculously fast memory access times(4 prefetch bytes in 1 cycle, then up to 4 cycles to read into the EU)? Thus 5 cycles for 4 bytes or simply 1.25 cycle per instruction byte or lower for small instructions?


I think that is true. In reality however you might have other things contending with the bus and also 386 has WS 1 (or more) memory IIRC.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby reenigne » 2017-10-10 @ 07:17

The 8088 and 8086 cannot remove more than one byte from the prefetch queue in any given cycle. We know this because the 8087 needs to track the state of the prefetch queue and the interface it uses to do so doesn't have any way to signal more than one byte removed in any given cycle.
User avatar
reenigne
Member
 
Posts: 409
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: BIU - EU prefetch interface for all x86

Postby superfury » 2017-10-10 @ 07:47

Does that one-byte at a time way of fetching instructions from the PIQ happen with later (286+) processors as well? So even during 16-bit and 32-bit immediate operands of an instruction(e.g. ADD EAX,00000001h with opcode 05h)? And the data is fetched into the PIQ from RAM/MMIO in dword quantities when possible(dword aligned address) or word quantities(word alignment, then followed by dword quantities) else byte quantities(byte alignment, after that word or dword quantities depending on alignment)? Also, there's the whole case of protection involved as well, so how does that all combine? Say you are at a dword aligned address, but the fourth byte is outside segment limits or outside paging limits, what happens? Three bytes fetched, fourth cleared? Entire fetch aborted as an error? Word fetch followed by byte fetch, then fetching is stopped?
superfury
l33t
 
Posts: 2048
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: BIU - EU prefetch interface for all x86

Postby reenigne » 2017-10-10 @ 08:24

superfury wrote:Does that one-byte at a time way of fetching instructions from the PIQ happen with later (286+) processors as well?


The 286 doesn't have the queue status interface that 8088/8086/80188/80186 have, so I have no way to know the size of the remove-from-prefetch-queue bus. I'd expect they'd have made it 16 bits by the time of the 80286 and 32 bits by the time of the 80386, but that's just a guess.
User avatar
reenigne
Member
 
Posts: 409
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-10 @ 12:32

reenigne wrote:
superfury wrote:Does that one-byte at a time way of fetching instructions from the PIQ happen with later (286+) processors as well?


The 286 doesn't have the queue status interface that 8088/8086/80188/80186 have, so I have no way to know the size of the remove-from-prefetch-queue bus. I'd expect they'd have made it 16 bits by the time of the 80286 and 32 bits by the time of the 80386, but that's just a guess.


Yes, no way 386 did get 1 byte at a time. Yes, there are all the complications about protections faults and unaligned reads and so on. But in the ideal circumstances when none of that happens if the EU needs a 32bit quantity for a 32bit ADD EAX, imm lets say, then I suspect it will just swoop that in 1 go if all of it is in the queue.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby Jepael » 2017-10-10 @ 13:14

It is more complex than just about how many bytes per cycle are fetched from PIQ.

80286 is more pipelined than 8086 was. On a 286, there is a instruction unit (IU) between execution unit (EU) and bus interface unit (BIU).

The IU fetches opcode data from BIU instruction queue, decodes them, and stores decoded instructions in another queue for execution.

That queue can hold three decoded instructions.

The EU then fetches the decoded instructions and executes them.
Jepael
Oldbie
 
Posts: 1174
Joined: 2005-6-15 @ 19:28
Location: Finland

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-10 @ 14:04

Ohh, interesting! I did not account for that. Do you happen to have a link or a document where I can find more about this?

This brings even more questions, like is the IU fetching the immediate values (of any) as part of its decoding?
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby Jepael » 2017-10-10 @ 16:53

vladstamate wrote:Ohh, interesting! I did not account for that. Do you happen to have a link or a document where I can find more about this?

This brings even more questions, like is the IU fetching the immediate values (of any) as part of its decoding?


Not much more info, just wikipedia and intel manuals.

Apparently EA calculation is performed in a dedicated unit as well, so it's not done on ALU any more.
That's also obvious from block diagram; there is a block called addressing unit (AU) as well.

https://en.wikipedia.org/wiki/Intel_80286
https://archive.org/details/bitsavers_intel80286areReferenceManual1987_14090554, figure 3-1
Jepael
Oldbie
 
Posts: 1174
Joined: 2005-6-15 @ 19:28
Location: Finland

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-10 @ 17:04

Thank you Jepael, this is very good information.
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby vladstamate » 2017-10-10 @ 17:06

Jepael wrote:Apparently EA calculation is performed in a dedicated unit as well, so it's not done on ALU any more.
That's also obvious from block diagram; there is a block called addressing unit (AU) as well.


In the light of what you said, this makes sense: that is probably to help with the pipelining. I assume so that IU can do EA as part of decoding while the EU is executing instructions (which might require the ALU).
User avatar
vladstamate
Oldbie
 
Posts: 691
Joined: 2015-8-23 @ 01:43

Re: BIU - EU prefetch interface for all x86

Postby superfury » 2017-10-12 @ 14:56

I've just modified the instruction fetching to be allowing single-cycle word/dword fetches from the PIQ, as well as updated my BIU emulation to be able to fetch 16-bit and 32-bit quantities from memory(as long as it's aligned, although I'm not 100% sure the check for odd 16/32-bit subunits is going 100% OK):

Fetching process itself:
Code: Select all
byte PIQ_block = 0; //Blocking any PIQ access now?
OPTINLINE void CPU_fillPIQ() //Fill the PIQ until it's full!
{
   uint_32 realaddress;
   if ((PIQ_block==1) || (PIQ_block==9)) { PIQ_block = 0; return; /* Blocked access: only fetch one byte/word instead of a full word/dword! */ }
   if (unlikely(BIU[activeCPU].PIQ==0)) return; //Not gotten a PIQ? Abort!
   realaddress = BIU[activeCPU].PIQ_Address; //Next address to fetch!
   checkMMUaccess_linearaddr = (CPU[activeCPU].SEG_base[CPU_SEGMENT_CS]+realaddress); //Default 8086-compatible address to use, otherwise, it's overwritten by checkMMUaccess with the proper linear address!
   if (unlikely(checkMMUaccess(CPU_SEGMENT_CS,CPU[activeCPU].registers->CS,realaddress,0x10|3,getCPL(),0,0))) return; //Abort on fault!
   if (unlikely(is_paging())) //Are we paging?
   {
      checkMMUaccess_linearaddr = mappage(checkMMUaccess_linearaddr,0,getCPL()); //Map it using the paging mechanism!      
   }
   writefifobuffer(BIU[activeCPU].PIQ, BIU_directrb(checkMMUaccess_linearaddr,0)); //Add the next byte from memory into the buffer!
   if (unlikely(checkMMUaccess_linearaddr&1)) //Read an odd address?
   {
      PIQ_block &= 5; //Start blocking when it's 3(byte fetch instead of word fetch), also include dword odd addresses. Otherwise, continue as normally!      
   }
   ++BIU[activeCPU].PIQ_Address; //Increase the address to the next location!
   //Next data! Take 4 cycles on 8088, 2 on 8086 when loading words/4 on 8086 when loading a single byte.
}


Main calling, which fetches bytes/words/dwords in one go(in byte subchunks, checking each one of them until an error occurs or misalignment(caused by PIQ_block cleared bits):
Code: Select all
                        PIQ_RequiredSize = 1; //Minimum of 2 bytes required for a fetch to happen!
                        PIQ_CurrentBlockSize = 3; //We're blocking after 1 byte access when at an odd address!
                        if (EMULATED_CPU>=CPU_80386) //386+?
                        {
                           PIQ_RequiredSize |= 2; //Minimum of 4 bytes required for a fetch to happen!
                           PIQ_CurrentBlockSize |= 4; //Apply 32-bit quantities as well, when allowed!
                        }
                        if (BIU_processRequests(memory_waitstates)) //Processing a request?
                        {
                           BIU[activeCPU].requestready = 0; //We're starting a request!
                           ++BIU[activeCPU].prefetchclock; //Tick!               
                        }
                        else if (fifobuffer_freesize(BIU[activeCPU].PIQ)>PIQ_RequiredSize) //Prefetch cycle when not requests are handled(2 free spaces only)? Else, NOP cycle!
                        {
                           CPU[activeCPU].BUSactive = 1; //Start memory cycles!
                           PIQ_block = PIQ_CurrentBlockSize; //We're blocking after 1 byte access when at an odd address at an odd word/dword address!
                           CPU_fillPIQ(); CPU_fillPIQ(); //Add a word to the prefetch!
                           if (PIQ_RequiredSize&2) //DWord access, when allowed?
                           {
                              CPU_fillPIQ(); CPU_fillPIQ(); //Add another word to the prefetch!
                           }
                           ++CPU[activeCPU].cycles_Prefetch_BIU; //Cycles spent on prefetching on BIU idle time!
                           BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch!
                           BIU[activeCPU].requestready = 0; //We're starting a request!
                           ++BIU[activeCPU].prefetchclock; //Tick!               
                        }
                        else //Nothing to do?
                        {
                           BIU[activeCPU].stallingBUS = 2; //Stalling!
                        }


Edit: Fixed up the data transfer size on 32-bit quantities a bit:
Code: Select all
                           if ((PIQ_RequiredSize&2) && ((EMULATED_CPU>=CPU_80386) && (CPU_databussize==0))) //DWord access on a 32-bit BUS, when allowed?
superfury
l33t
 
Posts: 2048
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: BIU - EU prefetch interface for all x86

Postby superfury » 2017-10-12 @ 15:20

A little update on the current 80386 timings(with the new DX/SX behaviour and PIQ fetching, as implemented in the above post):

456-Compaq Deskpro 386 16MHz 80386DX accuracy.jpg
Compaq Deskpro 386 - 386DX running MIPS 1.10

457-Compaq Deskpro 386 16MHz 80386SX accuracy.jpg
Compaq Deskpro 386 - 386SX running MIPS 1.10


Of course the same kind of improvements happen on the 80286 CPU, which is also a bit faster due to the new word fetching from memory and from PIQ.
superfury
l33t
 
Posts: 2048
Joined: 2014-3-08 @ 11:25
Location: Netherlands


Return to PC Emulation

Who is online

Users browsing this forum: superfury and 1 guest