UniPCemu cycle accurate 8088 implementation

Emulation of old PCs, PC hardware, or PC peripherals.

UniPCemu cycle accurate 8088 implementation

Postby superfury » 2017-3-30 @ 12:51

I'm just wondering, what does the 8086+ CPU's execution unit do when needing to fetch immediate data larger than 8-bits from the prefetch queue? Does it wait until 2 or 4 bytes are available in the prefetch queue, then read the parameter in one go from the prefetch buffer(16-bit or 32-bit read, maybe 64-bit pointer read on 32-bit CPUs(8 bytes of immediate parameters))? Or does it read the data one byte, then wait for the prefetch to fetch another byte if needed, then repeat until the whole parameter or parameters are fetched?
Last edited by superfury on 2017-4-09 @ 01:11, edited 1 time in total.
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby reenigne » 2017-3-31 @ 07:09

superfury wrote:I'm just wondering, what does the 8086+ CPU's execution unit do when needing to fetch immediate data larger than 8-bits from the prefetch queue? Does it wait until 2 or 4 bytes are available in the prefetch queue, then read the parameter in one go from the prefetch buffer(16-bit or 32-bit read, maybe 64-bit pointer read on 32-bit CPUs(8 bytes of immediate parameters))? Or does it read the data one byte, then wait for the prefetch to fetch another byte if needed, then repeat until the whole parameter or parameters are fetched?


8088 and 8086 can only read one byte from the prefetch queue into the EU per cycle. I know this is true even for the 8086 because the CPU needs to be able to tell the FPU about the queue and the interface by which it does so doesn't allow for more than one byte per cycle.

If the EU needs a 16-bit immediate from the prefetch queue, it won't wait for the second byte to be in the queue before taking the first byte out (at least on 8088) - I can see this from bus sniffer logs.

I have no idea about later CPUs.
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-3-31 @ 07:36

So if I understand it correctly, instruction fetching goes like this?

- Byte available? Fetch from prefetch(1 cycle)
- Byte not available? Do nothing(1 cycle)

Every 4th cycle(state 3), the BIU checks for BUS requests. If so, fetch. Else, fetch byte into prefetch.

So fetching an instruction byte takes either 1 or up to 4 cycles, depending on the prefetch buffer being filled or not?

Also, a little question about the 8088 MPH metric cycle count: is this actually the amount of cycles(or instructions) totally assumed to be spent, based on the PIT? So it sets the PIT, executes instructions(flushing prefetch by jmp first), reads the PIT again and converts it to an amount of cycles it's difference represents? So the 1604 cycles means it's running too fast? Currently, PIQ fetches take no time at all, needing to be 1 cycle?
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby reenigne » 2017-3-31 @ 09:39

superfury wrote:So if I understand it correctly, instruction fetching goes like this?

- Byte available? Fetch from prefetch(1 cycle)
- Byte not available? Do nothing(1 cycle)


Right (assuming the EU needs a byte from the queue).

superfury wrote:Every 4th cycle(state 3), the BIU checks for BUS requests. If so, fetch. Else, fetch byte into prefetch.


Not necessarily every 4th cycle - just when the bus is free. If the prefetch queue is full or prefetching is disabled then the bus goes idle and there's no latency for the next bus access.

superfury wrote:So fetching an instruction byte takes either 1 or up to 4 cycles, depending on the prefetch buffer being filled or not?


Not sure what you mean here. If the byte the EU needs is in the prefetch queue then moving it from the queue to the EU takes 1 cycle. Otherwise it'll need to be fetched by the BIU first, which can take any number of cycles (even less than 4 if the byte is currently in the process of being prefetched).

superfury wrote:Also, a little question about the 8088 MPH metric cycle count: is this actually the amount of cycles(or instructions) totally assumed to be spent, based on the PIT? So it sets the PIT, executes instructions(flushing prefetch by jmp first), reads the PIT again and converts it to an amount of cycles it's difference represents? So the 1604 cycles means it's running too fast?


Yes, the code measures the number of PIT cycles taken to run a certain routine consisting of a variety of different instructions, and gives the "This system is not the intended target for this program" if it does not take between 1668 and 1688 PIT cycles, so 1604 is too fast. There's no jmp to flush prefetch, but the timer access is done from a subroutine, so the call/ret will flush. The cycle count on a real system may vary slightly due to different PIT and refresh phases at the start of the routine, but that variance will be much less than 10 PIT cycles.

superfury wrote:Currently, PIQ fetches take no time at all, needing to be 1 cycle?


What's a PIQ fetch?
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-3-31 @ 11:43

reenigne wrote:
superfury wrote:So if I understand it correctly, instruction fetching goes like this?

- Byte available? Fetch from prefetch(1 cycle)
- Byte not available? Do nothing(1 cycle)


Right (assuming the EU needs a byte from the queue).

superfury wrote:Every 4th cycle(state 3), the BIU checks for BUS requests. If so, fetch. Else, fetch byte into prefetch.


Not necessarily every 4th cycle - just when the bus is free. If the prefetch queue is full or prefetching is disabled then the bus goes idle and there's no latency for the next bus access.

superfury wrote:So fetching an instruction byte takes either 1 or up to 4 cycles, depending on the prefetch buffer being filled or not?


Not sure what you mean here. If the byte the EU needs is in the prefetch queue then moving it from the queue to the EU takes 1 cycle. Otherwise it'll need to be fetched by the BIU first, which can take any number of cycles (even less than 4 if the byte is currently in the process of being prefetched).

superfury wrote:Also, a little question about the 8088 MPH metric cycle count: is this actually the amount of cycles(or instructions) totally assumed to be spent, based on the PIT? So it sets the PIT, executes instructions(flushing prefetch by jmp first), reads the PIT again and converts it to an amount of cycles it's difference represents? So the 1604 cycles means it's running too fast?


Yes, the code measures the number of PIT cycles taken to run a certain routine consisting of a variety of different instructions, and gives the "This system is not the intended target for this program" if it does not take between 1668 and 1688 PIT cycles, so 1604 is too fast. There's no jmp to flush prefetch, but the timer access is done from a subroutine, so the call/ret will flush. The cycle count on a real system may vary slightly due to different PIT and refresh phases at the start of the routine, but that variance will be much less than 10 PIT cycles.

superfury wrote:Currently, PIQ fetches take no time at all, needing to be 1 cycle?


What's a PIQ fetch?



With a PIQ fetch I mean reading an opcode to execute from the PIQ. So currently, there's two stages:
1. Fetch opcodes to execute either from PIQ when avaiable(current build takes 0 cycles to do that). Otherwise, take directly from memory(Taking 4 cycles).
2. After executing each opcode, divide the non-memory cycles(total cycles minus memory cycles) by 4 to get the amount of idle cycles. Then it uses that number to fetch from memory to the PIQ until the PIQ is full.

Code that fetches from memory into the PIQ after each instruction:
Code: Select all
void CPU_tickPrefetch()
{
   if (!CPU[activeCPU].PIQ) return; //Disable invalid PIQ!
   byte cycles;
   cycles = CPU[activeCPU].cycles; //How many cycles have been spent on the instruction?
   cycles -= CPU[activeCPU].cycles_MMUR; //Don't count memory access cycles!
   cycles -= CPU[activeCPU].cycles_MMUW; //Don't count memory access cycles!
   cycles -= CPU[activeCPU].cycles_IO; //Don't count I/O access cycles!
   cycles -= CPU[activeCPU].cycles_Prefetch; //Don't count memory access cycles by prefetching required data!
   //Now we have the amount of cycles we're idling.
   if (EMULATED_CPU<CPU_80286) //Old CPU?
   {
      for (;(cycles >= 4) && fifobuffer_freesize(CPU[activeCPU].PIQ);) //Prefetch left to fill?
      {
         CPU_fillPIQ(); //Add a byte to the prefetch!
         cycles -= 4; //This takes four cycles to transfer!
         CPU[activeCPU].cycles_Prefetch_BIU += 4; //Cycles spent on prefetching on BIU idle time!
      }
   }
   else //286+
   {
      for (;(cycles >= (2+CPU286_WAITSTATE_DELAY)) && fifobuffer_freesize(CPU[activeCPU].PIQ);) //Prefetch left to fill?
      {
         CPU_fillPIQ(); //Add a byte to the prefetch!
         cycles -= (2+CPU286_WAITSTATE_DELAY); //This takes four cycles to transfer!
         CPU[activeCPU].cycles_Prefetch_BIU += (2+CPU286_WAITSTATE_DELAY); //Cycles spent on prefetching on BIU idle time!
      }
   }
}


And the code reading bytes from prefetch or memory into to execute:
Code: Select all
byte CPU_readOP() //Reads the operation (byte) at CS:EIP
{
   byte result; //Buffer from the PIQ and actual memory data!
   uint_32 instructionEIP = CPU[activeCPU].registers->EIP++; //Our current instruction position is increased always!
   if (CPU[activeCPU].PIQ) //PIQ present?
   {
      PIQ_retry: //Retry after refilling PIQ!
      if (readfifobuffer(CPU[activeCPU].PIQ,&result)) //Read from PIQ?
      {
         if (checkMMUaccess(CPU_SEGMENT_CS, CPU[activeCPU].registers->CS, instructionEIP,3,getCPL(),!CODE_SEGMENT_DESCRIPTOR_D_BIT())) //Error accessing memory?
         {
            return 0xFF; //Abort on fault!
         }
         if (cpudebugger) //We're an OPcode retrieval and debugging?
         {
            MMU_addOP(result); //Add to the opcode cache!
         }
         CPU[activeCPU].cycles_OP += 1; //Fetching from prefetch takes 1 cycle!
         return result; //Give the prefetched data!
      }
      //Not enough data in the PIQ? Refill for the next data!
      CPU_fillPIQ(); //Fill instruction cache with next data!
      goto PIQ_retry; //Read again!
   }
   if (checkMMUaccess(CPU_SEGMENT_CS, CPU[activeCPU].registers->CS, instructionEIP,3,getCPL(),!CODE_SEGMENT_DESCRIPTOR_D_BIT())) //Error accessing memory?
   {
      return 0xFF; //Abort on fault!
   }
   result = MMU_rb(CPU_SEGMENT_CS, CPU[activeCPU].registers->CS, instructionEIP, 3,!CODE_SEGMENT_DESCRIPTOR_D_BIT()); //Read OPcode directly from memory!
   if (cpudebugger) //We're an OPcode retrieval and debugging?
   {
      MMU_addOP(result); //Add to the opcode cache!
   }
   return result; //Give the result!
}


The row adding 1 cycle to the cycles_OP(essentially the total cycles taken by the current step(execution phase essentially, so raw EU cycles that are taken in total)) is just added(not in the current build) to test the result against 8088 MPH. This messes timing up a bit, resulting 8088 MPH counting 1738 cycles instead of 1604 cycles.
Edit: Just ran again on CGA instead of VGA, now it gets 1739 cycles instead.
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-3-31 @ 12:01

Just ran the 8088 MPH demo again: now the carrier wave on the credits is a very high pitched noise instead(sounding a little shaky, like a little bit of vibrato) and the back to the future car has parts (at the start and when scrolling off screen) where it starts to get out of sync(the car has lines disappearing on and off against the background).
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby reenigne » 2017-3-31 @ 12:46

superfury wrote:With a PIQ fetch I mean reading an opcode to execute from the PIQ.


So PIQ is what you're calling the prefetch queue?

superfury wrote:So currently, there's two stages:
1. Fetch opcodes to execute either from PIQ when avaiable(current build takes 0 cycles to do that). Otherwise, take directly from memory(Taking 4 cycles).
2. After executing each opcode, divide the non-memory cycles(total cycles minus memory cycles) by 4 to get the amount of idle cycles. Then it uses that number to fetch from memory to the PIQ until the PIQ is full.


That's not how real hardware works, so I'm not convinced it will be possible to do cycle-accurate emulation with this method. For example, sometimes bus cycles take more than 4 cycles (for example if they're delayed by a DRAM refresh).

superfury wrote:The row adding 1 cycle to the cycles_OP(essentially the total cycles taken by the current step(execution phase essentially, so raw EU cycles that are taken in total)) is just added(not in the current build) to test the result against 8088 MPH. This messes timing up a bit, resulting 8088 MPH counting 1738 cycles instead of 1604 cycles.
Edit: Just ran again on CGA instead of VGA, now it gets 1739 cycles instead.


Presumably that's because these "transfer byte from prefetch queue to EU" cycles are already included in the published instruction timings. Remember, these timings are best case timings for real hardware, not theoretical best case timings for an EU connected to an infinitely fast prefetch queue.
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-3-31 @ 15:36

I've modified the entire (fetch-decode)-execute process to perform the fetch-decode in seperate steps, allowing for the CPU_readOP(the function that reads either the PIQ or forces the PIQ to load a byte when empty then return it currently) to stall the CPU instead. Currently though, stalling still takes 0 cycles.

I've tried to run the Turbo XT BIOS with my CGA emulation, but for some reason it ends up at F000:0000. This address shouldn't be accessed(it's not any ROM or anything loaded at that location)?
debugger_TurboXTBIOS_booting.zip
Booting process of the Turbo XT BIOS, quit the emulator when arriving at F000:0000.
(1.58 MiB) Downloaded 11 times


Can you see what's going wrong here? It's the 3.0 revision of the BIOS(Source code is included with the BIOS). Before applying the latest commits (the ones changing and fixing the way instructions are fetched into a more waitstateable format), the BIOS ran without problems(although slowly on CGA).

Edit: I've just tried various stuff like WinDiff etc. to compare the two text files, but all applications on Windows simply crash when trying to diff against the large text file(even with timestamp stripped). They probably can't handle a 406MB large text file to diff against? Although, the first 13MB should be enough(the size of the log of the emulator reaching F000:0000 only is 12.5MB(13.146.097 bytes).

Edit: Just found a splitter ( https://sourceforge.net/projects/file-splitter/ ) to split it into 30MB chunks, then compared the new debugger log with the old one. It seems that the modr/m processing is going completely wrong somehow:

Correct code(old version):
Code: Select all
Writing to memory: 00000000=AA (ª)
Writing to RAM: 00000000=AA (ª)
Writing to memory: 00000001=55 (U)
Writing to RAM: 00000001=55 (U)
ModR/M address: 0000:0000=00000000
F000:E0C9 (268915)MOVW [ES:DI],DX
EU&BIU cycles: 41, Operation cycles: 19, HW interrupt cycles: 0, Prefix cycles: 2, Exception cycles: 0, MMU read cycles: 0, MMU write cycles: 8, I/O bus cycles: 0, Prefetching cycles: 28, BIU prefetching cycles: 16
Registers:
AX: 0000, BX: 0000, CX: 0000, DX: 55AA
CS: F000, DS: 0040, ES: 0000, SS: 0000
SP: 0000, BP: 0000, SI: 0000, DI: 0000
IP: E0C9, FLAGS: F046
FLAGSINFO:c1P0a0Zstido1111
Interrupt status: 0000000000000000
VGA@4,0(CRT:2004,0)
Display=0,0


New code (incorrect handling of ModR/M):
Code: Select all
VGA@4,0(CRT:2100,0)
Display=0,0

F000:E0C9 (268915263B)ADC AX, 3B26
EU&BIU cycles: 31, Operation cycles: 9, HW interrupt cycles: 0, Prefix cycles: 2, Exception cycles: 0, MMU read cycles: 0, MMU write cycles: 0, I/O bus cycles: 0, Prefetching cycles: 28, BIU prefetching cycles: 8
Registers:
AX: 0000, BX: 0000, CX: 0000, DX: 55AA
CS: F000, DS: 0040, ES: 0000, SS: 0000
SP: 0000, BP: 0000, SI: 0000, DI: 0000
IP: E0C9, FLAGS: F046
FLAGSINFO:c1P0a0Zstido1111
Interrupt status: 0000000000000000
VGA@6,0(CRT:2118,0)
Display=0,0
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-3-31 @ 19:52

Just tested again after fixing the opcode reading bug:
Code: Select all
nextprefix: //Try next prefix/opcode?
         if (CPU_readOP(OP)) return 1; //Read opcode or prefix?
         if (CPU_isPrefix(*OP)) //We're a prefix?
         {
            CPU[activeCPU].cycles_Prefix += 2; //Add timing for the prefix!
            if (ismultiprefix && (EMULATED_CPU <= CPU_80286)) //This CPU has the bug and multiple prefixes are added?
            {
               CPU_InterruptReturn = last_eip; //Return to the last prefix only!
            }
            CPU_setprefix(*OP); //Set the prefix ON!
            last_eip = CPU[activeCPU].registers->EIP; //Save the current EIP of the last prefix possibility!
            ismultiprefix = 1; //We're multi-prefix now when triggered again!
            if (CPU_readOP(OP)) return 1; //Next opcode/prefix!
            if (CPU[activeCPU].faultraised) return 1; //Abort on fault!
            goto nextprefix; //Try the next prefix!
         }
         else //No prefix? We've read the actual opcode!
         {
            CPU[activeCPU].instructionfetch.CPU_fetchphase = 3; //Advance to stage 3: Fetching 0F instruction!
         }


Needed to be:
Code: Select all
nextprefix: //Try next prefix/opcode?
         if (CPU_readOP(OP)) return 1; //Read opcode or prefix?
         if (CPU[activeCPU].faultraised) return 1; //Abort on fault!
         if (CPU_isPrefix(*OP)) //We're a prefix?
         {
            CPU[activeCPU].cycles_Prefix += 2; //Add timing for the prefix!
            if (ismultiprefix && (EMULATED_CPU <= CPU_80286)) //This CPU has the bug and multiple prefixes are added?
            {
               CPU_InterruptReturn = last_eip; //Return to the last prefix only!
            }
            CPU_setprefix(*OP); //Set the prefix ON!
            last_eip = CPU[activeCPU].registers->EIP; //Save the current EIP of the last prefix possibility!
            ismultiprefix = 1; //We're multi-prefix now when triggered again!
            goto nextprefix; //Try the next prefix!
         }
         else //No prefix? We've read the actual opcode!
         {
            CPU[activeCPU].instructionfetch.CPU_fetchphase = 3; //Advance to stage 3: Fetching 0F instruction!
         }


Thus it was reading the opcodes incorrectly, reading a second opcode, ignoring the opcode already read each time it found an instruction prefix opcode.

Edit: I've managed to get my own BIOS working again after applying the new Prefetch-only input into the emulator. The current opcode was being discarded if the prefetch unit was empty while fetching the opcode or parameters. Simply making the opcode variable static fixed this, thus allowing the emulator 'BIOS' to start itself, causing the boot text to appear and the emulation to start the real BIOS(Generic Super PC/Turbo XT BIOS v3.0). It still ends up at a location it isn't supposed to be, but that's the next problem to find and fix:P

The 8086/8088 BIU is now roughly emulated by checking if the CPU is stalling(<4 cycles without any memory or hardware I/O). If this is the case, it increments it's own counter(BIU 4-cycle clock(808X) or 2-cycle clock(80286+)) to compare. Otherwise, the counter is increased only and no action is taken(essentially a BIU NOP). If the counter matches 3(808X) or 1(80286+), it will force a fetch from memory into the BIU. The instruction fetched is immediately available to the virtual execution unit on the next cycle.

Current source code: https://bitbucket.org/superfury/unipcem ... ?at=master
ModR/M routines(modrm_readparams is used by the CPU_readOP_prefix routine), which now stalls when needed as well(it's linked to CPU_readOP as well): https://bitbucket.org/superfury/unipcem ... ?at=master

Fully accurate timings(EU delays and memory/I/O timings) on the execution part(the handling of the opcode-specific handler in CPU_OP, called through the jumptable) isn't supported yet.
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-01 @ 14:46

On a side note, I've just modified all but the PSP Makefiles to show simple text when compiling (like the Android NDK does) instead of showing the full commands. I've also added support for an install command to copy the compiled executable to the bin directory, as well as it now creating .d dependancy files to only recompile files with changed headers or code. This should fix many of the problems with dependencies changing when updated(Visual C++ and Android NDK do this already).

The only little Makefile thing left is that the parameters (phony triggers) given to the Makefile to make it build correctly(e.g. SDL vs SDL2 using the SDL2 parameter, (re)build/(re)profile/(re)debug/analyze(2) and install parameters, Windows x64 parameter to build 64-bit executables instead). Make keeps complaining about there being nothing to do for those targets?
Code: Select all
Using 64-bit executable
make: Nothing to be done for `win'.
Compiling exception/exception.c
... lots of compiling project files listed here ...
Creating release executable ../../projects_build/UniPCemu/UniPCemu_x64.exe...
make: Nothing to be done for `SDL2'.
make: Nothing to be done for `x64'.


Is there any way to make it stop spewing out those "Nothing to be done for `phonytrigger`."? ".PHONY: ..." targets have been implemented inside the Makefiles for the platforms (and generally in the multiplatform Makefile), but it still complains about them like that?
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-01 @ 19:14

I've managed to improve on the prefetch somewhat:
- It was reading bytes into the prefetch queue, while the queue was still full. This went unchecked, discarding bytes from memory while still increasing the prefetch idea of where to fetch next.
- It was resetting the CPU state(allowing interrupts for the next instruction etc.) while the instruction was still busy to be fetched.

Somehow, I end up at F000:E064(according to the prefetch queue location), but the CPU's EIP value is pointing to F000:E069 instead. Something's going wrong here...

Edit: More progress: It was increasing EIP while the PIQ kept stuck at the correct address! The PIQ was still in the process of fetching data from memory(or waiting for state T3), but the CPU_readOP function was checking EIP and increasing it, then checking the prefetch for data to return. This caused the EIP register to be increased too much(it increased even when the PIQ didn't have any data to deliver). Now I at least see the CGA starting up again(just for testing, I've switched the emulated monitor to CGA for ease of setup and bug checking against the older version).
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-01 @ 20:14

The prefetching seems to be working somewhat now. The CGA is initialized and the BIOS now seems to be infinitely printing NULL/space characters? I see nothing printed, but the cursor keeps increasing positions? Thus it keeps walking over the screen, left to right, top to bottom? Maybe the routine to move the cursor forward is hanging?

Current CPU prefetching(readOP and related, only calling the MMU during prefetching and without PIQ. All opcode fetching have readOP in their function name, except for tickPrefetch, which calls CPU_fillPIQ to read a PIQ byte from memory into the PIQ): https://bitbucket.org/superfury/unipcem ... ?at=master

The ModR/M handles it much the same way(modrm_readparams): https://bitbucket.org/superfury/unipcem ... ?at=master

Can you see anything going wrong here?

Edit: The main CPU code fething routine is CPU_readOP_prefix, which calls modrm_readparams(reading modr/m parameters) and the three readOP functions to read opcodes and parameters. The function returns 1 when to wait for the PIQ to fill, 0 otherwise.

The main CPU step routine(CPU_exec) also calls the main instruction handler(CPU_OP), which calls the CPU8086_OPXX(replace XX with the opcode) to essentially execute the instruction(currently in one go). This currently doesn't support idle cycles directly yet, just adds to the CPU.cycles_OP etc. variables the amount of memory and execution cycles to consume. At the end, tickPrefetch is called to update the BIU. This happens after each step(essentially everything up to that point is the EU working). This function returns into the main loop, which then updates all hardware accordingly and syncs to realtime(after a block of CPU+hardware calls) using the high resolution clock and finally updates the display if needed.

tickPrefetch currently prefetches either roughly every 4th(808X) or 2nd(80286+) cycle, or when multiples of 4 or 2 cycles have been idled according to the cycles numbers. This last old method is currently there for compatibility with the 80286 and old EU instruction handling.
Last edited by superfury on 2017-4-02 @ 02:08, edited 2 times in total.
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-02 @ 00:45

I've updated the BIU(tickPrefetch) to count cycles now. Although it still handles cycles_MMUR/W and cycles_IO as NOPs for current code compatibility. That should make it behave better.

It now either fetches into the PIQ(Prefetch Input Queue, according to wikipedia) or performs a BIU NOP(when the above mentioned cycles are non-zero(2 multiples(286+) or 4 multiples(80(1)8X) every time it ticks T3. See https://www.vogons.org/viewtopic.php?f= ... 9&start=40 .

The EU isn't synchronized(besides providing timing of at least 1 cycle to the BIU) though. It starts it's fetched instruction immediately once it's gotten it's opcode&parameters. So that's the cycle after the BIU has fetched the last byte(T1) or any other state when fully in the PIQ(depending on PIQ filling). 286+ memory waitstates are removed atm.

Edit: https://bitbucket.org/superfury/unipcem ... ?at=master

Improved prefetch moving write cycles to the end, prefetch cycles in between. Also, only fetch opcodes from the PIQ during T3(80(1)8X) or T1(286+). Also, fixed the infinite loop needing to be a simple if.
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-02 @ 11:50

I've managed to get the BIOS running again, after adding some extra support to not apply some post-execution states during EU waitstates(like previousCSstart(debugging purposes only), Post-Execution REP handling, Post-exection previousopcode registration and Post-exection CPU_afterexec(which handles the Trap flag and Protected Mode debugging traps).

Now, the BIOS starts the memory scan and reports a memory error(error 02) on 64K?

Edit: After fixing the handling on interrrupts etc. in the main emulation core, the entire BIOS POSTs without problems and boots. Now 8088 MPH is more accurate than ever(according to it's metric cycle count):

356_8088MPH_1%deviation_1702cycles.jpg
8088 MPH, deviating 1% at 1702 cycles!


Now, it's apparently more accurate than CAPEx86(vladstamates cycle-accurate emulator), according to the cycle count! It deviates 1%, at 1702 metric (PIT) cycles.

Edit: Although the running demo doesn't seem to agree fully:P (Car scrolling off-screen last bit as well as the Kefrens Bars. Also the music becoming slow)

Edit: The credits crash now. So there's a problem with the prefetching now, not prefetching in time?
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-02 @ 12:19

OK. So the current prefetching(and related) algorithm goes like this:
- EU: Byte ready to fetch(CPU_readOP)? Fetch in 1 idle BIU cycle, but only when at T3 according to the BIU.
- EU: Byte not ready to fetch? Stop EU and wait 1 cycle for the BIU to fill it up.
- BIU: Are we T4? If so, check memory information. When supposed to by reading data from memory this cycle(first X BIU cycles(MMUR/IO cycles divided by 4), do nothing(dummy read for EU). Else, if at the last X BIU cycles and to apply write cycles for the remaining cycles, do nothing(dummy write for EU).
When not applying dummy read/write cycles, fetch byte into PIQ when the PIQ isn't full.

Edit: The crash at the credits seems to be remedied by implementing the PIQ to skip BIU fetches when the DMA is accessing memory(this previously wasn't handled yet). Although this is currently not cycle-accurate, just handling the amount of DMA transfers during letting the PIQ skip fetching for that amount.

Recording of the credits in my new more PIQ-accurate emulation:
https://www.dropbox.com/s/84dg4kxmogxl4 ... 0.zip?dl=0

Edit: Just compiled on MinGW 64-bit again: The credits once again crash. So this might be an actual timing-related problem somehow? Like the moment the application is started or the demo is started(Relative DMA cycle)?

Edit: Probably this is a problem with the REP prefix handling itself(post-execution)? The self-modifying code failing due to the REPeated instruction failing? Or is this because the PIQ is too empty to function(not filled enough during REP MOV)?
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-02 @ 20:09

I've just taken a look at http://www.reenigne.org/blog/8088-pc-sp ... -its-done/ . It doesn't show any REP instructions mentioned, just a POPW [CS:BX] to do the self-modifying part. So the problem is actually in the prefetching not timing correctly? Can you see anything going wrong(tickPrefetch function)?

https://bitbucket.org/superfury/unipcem ... ?at=master

Edit: Something should be going wrong here. Opcode 8F should be the self modifying instruction, taking 17 cycles and read and write 1 word(16 cycles consumed for this), thus leaving 1 cycle in between and EA timing at the start to prefetch? Thus the next instructions aren't in the prefetch most of the time(only X bytes during EA calculations and the 1 cycle between read and write)?
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby reenigne » 2017-4-02 @ 20:58

Empirically, on a real 8088, the "mov cl,99" instruction is in the prefetch queue at the time the "pop word[cs:bx]" instruction has an effect. If it wasn't, the effect wouldn't work properly. In fact, I have a special version which puts the "mov cl,99" before the "pop word[cs:bx]" so that it works correctly on DOSBox. This change screws up the timings on real hardware though.
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-02 @ 21:19

Ok. Is it correct that instructions are only fetched from the prefetch on the T3 state? And T4 state loads the prefetch from memory(BIU)?

Edit: So the prefetch must be full when the pop to patch executes somehow(2 bytes for the pop, 2 bytes for the mov cl). How is this archieved? pop bx doesn't leave enough room, and the combination with loop?

loop has 4 idle cycles(filling prefetch at least once, pop bx has one memory access, thus 4 out of 8 cycles idle. Thus only 2 bytes can be prefetched. So the prefetch should be half filled before the end of the loop? Maybe the out instruction doesn't provide enough free time? Out xx,al takes 5 cycles, of which 4 cycles for I/O. Thus only one cycle left. Of course every PIQ fetch adds 1 cycle to the time at the beginning, but this isn't used by the BIU. It simply moves that cycle between the reads and writes of the instruction(cycle position-wise). So this would be 4 cycles output, 2 cycles possible prefetch in this case.
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: x86 prefetching and larger immediate parameters?

Postby reenigne » 2017-4-02 @ 21:43

superfury wrote:Ok. Is it correct that instructions are only fetched from the prefetch on the T3 state?


No, the EU can start an instruction on any bus state.

superfury wrote:And T4 state loads the prefetch from memory(BIU)?


The 8088 datasheet says, "The address is emitted from the processor during T1 and data transfer occurs on the bus during T3 and T4. T2 is used primarily for changing the direction of the bus during read operations." Does that answer your question?

superfury wrote:Edit: So the prefetch must be full when the pop to patch executes somehow(2 bytes for the pop, 2 bytes for the mov cl). How is this archieved? pop bx doesn't leave enough room, and the combination with loop?


Instructions don't execute from the prefetch queue (we know this because there are instructions longer than 4 bytes). After the EU removes the first instruction byte from the prefetch queue, the BIU will start fetching another one (if it's not already in the process of doing so) before the instruction itself gets a chance to start doing its own bus operations.

superfury wrote:loop has 4 idle cycles(filling prefetch at least once, pop bx has one memory access, thus 4 out of 8 cycles idle. Thus only 2 bytes can be prefetched. So the prefetch should be half filled before the end of the loop? Maybe the out instruction doesn't provide enough free time?


I'll see if I can get a bus sniffer dump of this code for you tomorrow so you can see what happens when.
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: x86 prefetching and larger immediate parameters?

Postby superfury » 2017-4-03 @ 05:52

I've just moved the PIQ fetch cycles to the start of the instruction(after DMA cycles). Also changed instruction fetching from BIU to start on any cycle. Now it gives 1399 cycles instead during 8088 MPH startup.
Edit: Strange, how could it be a indivisible-by-four number, seeing as the PIT ticks each 4 cycles only?

Edit: I've just moved the entire BIU-related functionality (all basic EIP-related functionality, PIQ functionality and the BIU tick function(now renamed CPU_tickBIU instead of CPU_tickPrefetch)) and added some theoretical support for the EU to give it requests for 8/16/32-bit memory and handle based on them.

https://bitbucket.org/superfury/unipcem ... ?at=master
Last edited by superfury on 2017-4-03 @ 14:16, edited 1 time in total.
superfury
l33t
 
Posts: 3230
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Next

Return to PC Emulation

Who is online

Users browsing this forum: No registered users and 2 guests