UniPCemu cycle accurate 8088 implementation

Emulation of old PCs, PC hardware, or PC peripherals.

Re: UniPCemu cycle accurate 8088 implementation

Postby reenigne » 2017-5-03 @ 12:19

superfury wrote:When it has nothing to do(no BIU memory/BUS requests and prefetch buffer is full), it will execute a NOP cycle(BUS being idle, now reporting "BIU --" instead of T-state.


So what's going on in your latest log between lines 3260 and 3277? There are four bus cycles (16 CPU cycles, counting up T1..T4) but no apparent bus activity going on. Compare this to a similar JMP in the sniffer log, where there are 6 idle CPU cycles before the bus starts fetching instruction bytes at the destination.
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: UniPCemu cycle accurate 8088 implementation

Postby vladstamate » 2017-5-03 @ 12:20

I wanted to say thank you to both reenigne and Superfury for the work in untangling the correct cycle behavior of 8088 (and its XT bus). I am incorporating a lot of that theoretical information in CAPE.
User avatar
vladstamate
Oldbie
 
Posts: 959
Joined: 2015-8-23 @ 01:43

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2017-5-03 @ 12:29

That's actually a jump being executed(JMP 026C). It uses 15 CPU EU cycles, of which 15 cycles BIU disabled(cycles_OP is loaded with 15 and BIU idle cycles is loaded with the (same) value in cycles_OP to make it idle around, preventing fetches from starting on T1). So it's the effective 16 idle EU cycles of the jump, according to documentation, in which the EU is prevented from starting fetch operations on T1.

A little log again, which converts those idle BIU cycles into -- entries(although still increases T-state atm):
https://www.dropbox.com/s/x8lcbnwwyqo0x ... 5.zip?dl=0

Are the T-states to be kept the same until the stall completes, like with the idle/T1 'state' emulation done now?
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby reenigne » 2017-5-03 @ 13:04

superfury wrote:That's actually a jump being executed(JMP 026C). It uses 15 CPU EU cycles, of which 15 cycles BIU disabled


15 cycles seems too long to disable the BIU here. As I said, a real 8088 only idles the bus for 6 cycles here, so prefetches can't be disabled for more than 8.

superfury wrote:Are the T-states to be kept the same until the stall completes, like with the idle/T1 state?


Yes, I don't think there's any difference here. Either the bus is actually doing an operation (in which case it's going up from T1..T4 or S0..S4 or both) or it's not and it's idle.
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2017-5-03 @ 13:34

I've modified the BIU to take those idle cycles without increasing the T-state.
I've modified opcode 0xEB to use 16 execution cycles, of which the first 6 cycles idle the BIU, keeping the same T-state.

This is the new log:
https://www.dropbox.com/s/9qk4rzcm67vxe ... 8.zip?dl=0

Edit: What about the other (un)conditional jump instructions and call instructions? Do they delay the BIU 6 cycles as well?
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby reenigne » 2017-5-03 @ 14:11

superfury wrote:I've modified the BIU to take those idle cycles without increasing the T-state.
I've modified opcode 0xEB to use 16 execution cycles, of which the first 6 cycles idle the BIU, keeping the same T-state.


That looks a bit better. Next problem: the bus going idle only happens after a T4 state completes (it doesn't interrupt the current transfer). As you have it, the idle is between the T3 and T4 states (e.g. line 3155).

Another problem: the OUT instruction needs 3 extra CPU cycles of delay. On two of them the bus is idle before the port bus cycle starts (this is part of the cycle count of the "OUT ib,accum" instruction). The remaining one is a wait state between the T3 and T4 cycles of the port bus access, and is added by the 5150/5160 motherboard for port accesses to motherboard devices. Btw, I don't see the addresses and data bytes for port and DMA accesses in your logs - it would be useful to be able to see those as well. Likewise for the special bus accesses related to interrupts (not that that's important for the 8088MPH mod player!)

Also, as I mentioned previously, the way the bus is interrupted for DMAs is wrong. From your logs, it appears that the 5 DMA S-cycles always occur between the T4 state of the previous bus access and the T1 of the next. However, the 8088 doesn't have a pin for the DMA controller to tell it "let me take control of the bus after the current access is complete". Instead, DMAC takes control by adding up to 6 wait states (for normal RAM) to the current bus transfer if necessary. I've attached a file showing snippets of sniffer logs for all the possible relationships between a DMA bus access and a CPU bus access.

waits.txt
(3.66 KiB) Downloaded 10 times
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: UniPCemu cycle accurate 8088 implementation

Postby reenigne » 2017-5-03 @ 14:34

superfury wrote:Edit: What about the other (un)conditional jump instructions and call instructions? Do they delay the BIU 6 cycles as well?


Taken conditional jumps and LOOPs (including JCXZ): 6 cycles.
Near/short JMP: 6 cycles.
Indirect JMP (i.e. "JMP CX"): 3 cycles.
Indirect CALL (i.e. "CALL CX") and near CALL: 10 cycles, of which last 4 are the prefetch of the instruction at the destination.
Far JMP: 4 cycles.
mov [iw],accum: 2 cycles.
Far CALL: 5 cycles before first stack store, 9 cycles before the second stack stack store (note that the prefetch of the destination instruction takes the last 4 of these 9 cycles).
OUT DX,accum and IN accum,DX: no delay except for the 1 cycle wait state
PUSH rw, PUSH segreg, PUSHF: 2 cycles before stack operation.
MOVSB, MOVSW: 3 cycles between load and store.
REP MOVSB, REP MOVSW: same, also 6 cycles between each load/store pair (0 between halves of a word load/store).
REP STOSB, REP STOSW: 6 cycles between each store (0 between halves of a word store).
REP LODSB, REP LODSW: 9 cycles between each load (0 between halves of a word load).
RET: 3 cycles between stack store and first prefetch at destination.
RET iw: 2 cycles before stack store, 4 cycles between stack store and first prefetch at destination.
XLATB: 2 cycles before load, 2 cycles after.
ADD B[SI],AL: 2 cycles before read, 3 cycles before write.
ADD AL,B[SI}: 2 cycles before read.
CMP [SI],accum: 2 cycles before read.

I've attached a file showing sniffer logs of all these and more (but not an exhaustive list). Note that some may be different for other bus states.

sniffer_timings.txt
(334.33 KiB) Downloaded 13 times
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2017-5-03 @ 16:41

reenigne wrote:That looks a bit better. Next problem: the bus going idle only happens after a T4 state completes (it doesn't interrupt the current transfer). As you have it, the idle is between the T3 and T4 states (e.g. line 3155).


I've currently just implemented the bus going idle logic with the T1 state releasing the bus:
https://www.dropbox.com/s/ijg7td0v6kkg5 ... 2.zip?dl=0

Current BIU code:
https://bitbucket.org/superfury/unipcem ... ?at=master
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2017-5-20 @ 18:06

I'm now retesting the 808X core, but it seems something is going wrong with various instructions. One thing that immediately pops out is the DIV instruction not faulting on 8-bit overflow(result of 101h isn't giving an error with resultbits being 8?)

Code: Select all
void CPU8086_internal_DIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle, byte *applycycles)
{
   uint_32 temp, temp2, currentquotient; //Remaining value and current divisor!
   byte shift; //The shift to apply! No match on 0 shift is done!
   temp = val; //Load the value to divide!
   *applycycles = 1; //Default: apply the cycles normally!
   if (divisor==0) //Not able to divide?
   {
      *quotient = 0;
      *remainder = temp; //Unable to comply!
      *error = 1; //Divide by 0 error!
      return; //Abort: division by 0!
   }

   if (CPU_apply286cycles()) /* No 80286+ cycles instead? */
   {
      SHLcycle = ADDSUBcycle = 0; //Don't apply the cycle counts for this instruction!
      *applycycles = 0; //Don't apply the cycles anymore!
   }

   temp = val; //Load the remainder to use!
   *quotient = 0; //Default: we have nothing after division!
   nextstep:
   //First step: calculate shift so that (divisor<<shift)<=remainder and ((divisor<<(shift+1))>remainder)
   temp2 = divisor; //Load the default divisor for x1!
   if (temp2>temp) //Not enough to divide? We're done!
   {
      goto gotresult; //We've gotten a result!
   }
   currentquotient = 1; //We're starting with x1 factor!
   for (shift=0;shift<(resultbits+1);++shift) //Check for the biggest factor to apply(we're going from bit 0 to maxbit)!
   {
      if ((temp2<=temp) && ((temp2<<1)>temp)) //Found our value to divide?
      {
         CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 more SHL cycle for this!
         break; //We've found our shift!
      }
      temp2 <<= 1; //Shift to the next position!
      currentquotient <<= 1; //Shift to the next result!
      CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 SHL cycle for this! Assuming parallel shifting!
   }
   if (shift==(resultbits+1)) //We've overflown? We're too large to divide!
   {
      *error = 1; //Raise divide by 0 error due to overflow!
      return; //Abort!
   }
   //Second step: substract divisor<<n from remainder and increase result with 1<<n.
   temp -= temp2; //Substract divisor<<n from remainder!
   *quotient += currentquotient; //Increase result(divided value) with the found power of 2 (1<<n).
   CPU[activeCPU].cycles_OP += ADDSUBcycle; //We're taking 1 substract and 1 addition cycle for this(ADD/SUB register take 3 cycles)!
   goto nextstep; //Start the next step!
   //Finished when remainder<divisor or remainder==0.
   gotresult: //We've gotten a result!
   if (temp>((1<<resultbits)-1)) //Modulo overflow?
   {
      *error = 1; //Raise divide by 0 error due to overflow!
      return; //Abort!      
   }
   *remainder = temp; //Give the modulo! The result is already calculated!
   *error = 0; //We're having a valid result!
}


What would be the correct thing to do in this case? Simply add a protection against overflow like the final temp modulo overflow check?

Edit: Fixed simply:
Code: Select all
void CPU8086_internal_DIV(uint_32 val, word divisor, word *quotient, word *remainder, byte *error, byte resultbits, byte SHLcycle, byte ADDSUBcycle, byte *applycycles)
{
   uint_32 temp, temp2, currentquotient; //Remaining value and current divisor!
   byte shift; //The shift to apply! No match on 0 shift is done!
   temp = val; //Load the value to divide!
   *applycycles = 1; //Default: apply the cycles normally!
   if (divisor==0) //Not able to divide?
   {
      *quotient = 0;
      *remainder = temp; //Unable to comply!
      *error = 1; //Divide by 0 error!
      return; //Abort: division by 0!
   }

   if (CPU_apply286cycles()) /* No 80286+ cycles instead? */
   {
      SHLcycle = ADDSUBcycle = 0; //Don't apply the cycle counts for this instruction!
      *applycycles = 0; //Don't apply the cycles anymore!
   }

   temp = val; //Load the remainder to use!
   *quotient = 0; //Default: we have nothing after division!
   nextstep:
   //First step: calculate shift so that (divisor<<shift)<=remainder and ((divisor<<(shift+1))>remainder)
   temp2 = divisor; //Load the default divisor for x1!
   if (temp2>temp) //Not enough to divide? We're done!
   {
      goto gotresult; //We've gotten a result!
   }
   currentquotient = 1; //We're starting with x1 factor!
   for (shift=0;shift<(resultbits+1);++shift) //Check for the biggest factor to apply(we're going from bit 0 to maxbit)!
   {
      if ((temp2<=temp) && ((temp2<<1)>temp)) //Found our value to divide?
      {
         CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 more SHL cycle for this!
         break; //We've found our shift!
      }
      temp2 <<= 1; //Shift to the next position!
      currentquotient <<= 1; //Shift to the next result!
      CPU[activeCPU].cycles_OP += SHLcycle; //We're taking 1 SHL cycle for this! Assuming parallel shifting!
   }
   if (shift==(resultbits+1)) //We've overflown? We're too large to divide!
   {
      *error = 1; //Raise divide by 0 error due to overflow!
      return; //Abort!
   }
   //Second step: substract divisor<<n from remainder and increase result with 1<<n.
   temp -= temp2; //Substract divisor<<n from remainder!
   *quotient += currentquotient; //Increase result(divided value) with the found power of 2 (1<<n).
   CPU[activeCPU].cycles_OP += ADDSUBcycle; //We're taking 1 substract and 1 addition cycle for this(ADD/SUB register take 3 cycles)!
   goto nextstep; //Start the next step!
   //Finished when remainder<divisor or remainder==0.
   gotresult: //We've gotten a result!
   if (temp>((1<<resultbits)-1)) //Modulo overflow?
   {
      *error = 1; //Raise divide by 0 error due to overflow!
      return; //Abort!      
   }
   if (*quotient>((1<<resultbits)-1)) //Quotient overflow?
   {
      *error = 1; //Raise divide by 0 error due to overflow!
      return; //Abort!      
   }
   *remainder = temp; //Give the modulo! The result is already calculated!
   *error = 0; //We're having a valid result!
}


This makes it divide correctly again and fixes all related bugs.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2019-5-18 @ 12:20

Just was messing around with the DMA timings. Then I noticed that the bus is released on S0(at CPU T4 state), which DMA takes and immediately starts processing S0 the next cycle. As far as I then saw on the DMA timings information, the HLDA is released at T4, but not taken by the DMA until T1(so 1 state later). Then I changed the CPU to release the bus at T4, setting a flag for the DMA controller to delay. Then the DMA ticks it's cycle and finds out the CPU has released the bus, taking control of the bus(So in effect, HLDA and HOLD are set and reset on a real CPU). Then the DMA controller waits for the next cycle to check again. At that cycle(the second S0 state), it finds the BUS acquired, but the flag is set. So it decreases(clears) the flag and waits another cycle. The next cycle, it finds the bus acquired and the flag cleared, does it's checks and hardware stuff and processes to the T1 state properly.

I now ran the 8088MPH demo again, saw the delorean failing when scrolling off screen halfway, saw the raster racing fail(like usually), but when it ran the credits...

The credits didn't crash, running on, playing it's music! :D
Any idea what's the cause?

Edit: A capture of the new DMA emulation combined with 8088MPH: https://www.dropbox.com/s/2w89ztpugf6ap ... H.wav?dl=0
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby Alegend45 » 2019-5-18 @ 16:14

If you want 8088MPH to run well, maybe you should reference 86Box's emulation. AFAIK, the latest code runs it perfectly.
User avatar
Alegend45
Newbie
 
Posts: 75
Joined: 2012-6-23 @ 18:18

Re: UniPCemu cycle accurate 8088 implementation

Postby Scali » 2019-5-18 @ 16:22

Alegend45 wrote:If you want 8088MPH to run well, maybe you should reference 86Box's emulation. AFAIK, the latest code runs it perfectly.


Not perfectly, but it doesn't crash on any effect, and they are all recognizable.
Timing is still off quite a bit in certain parts.
Scali
l33t
 
Posts: 4364
Joined: 2014-12-13 @ 14:24

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2019-5-18 @ 17:11

Well, it has the same timing source as I used(reenigne's findings in his emulator code). As far as I can remember, the only possible differences were the handling of REP-prefixable instructions(MOVS etc.) and the HLT instruction. Is HLT even used in 8088MPH?

Oddly enough, even with the credits running, those two other places(raster racing and delorean car) still show issues(delorean fading out depending on timings, raster messing up vertical(foreground) and all(background and screen alignment) timings(the wall effect seems fine, though).
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby reenigne » 2019-5-18 @ 17:55

superfury wrote:Is HLT even used in 8088MPH?


Yes, in the Kefrens bars effect. Also in the loader, but it's not time-critical there.
User avatar
reenigne
Oldbie
 
Posts: 509
Joined: 2006-11-30 @ 05:13
Location: Cornwall, UK

Re: UniPCemu cycle accurate 8088 implementation

Postby Scali » 2019-5-18 @ 17:57

reenigne wrote:
superfury wrote:Is HLT even used in 8088MPH?


Yes, in the Kefrens bars effect. Also in the loader, but it's not time-critical there.


Also in the fade in/out routines of the Sprite parts, but not time-critical there either.
Scali
l33t
 
Posts: 4364
Joined: 2014-12-13 @ 14:24

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2019-5-18 @ 18:16

Odd that exactly those two are failing atm. One thing I notice as well is noise in an exclusive infinity shape(left circle being larger) during the vectorballs part(so 8 turned sideways, so like Oo, with the IBM logo in vectorballs scraping past the edge(like it's scraping out the noise)? It's like a background of white noise, with the vectorballs clearing the display it's passed over.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby Alegend45 » 2019-5-18 @ 19:27

Scali wrote:
Alegend45 wrote:If you want 8088MPH to run well, maybe you should reference 86Box's emulation. AFAIK, the latest code runs it perfectly.


Not perfectly, but it doesn't crash on any effect, and they are all recognizable.
Timing is still off quite a bit in certain parts.

By "the latest code" I'm referring to code that hasn't been committed yet.
User avatar
Alegend45
Newbie
 
Posts: 75
Joined: 2012-6-23 @ 18:18

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2019-5-18 @ 20:02

Any idea why the vectorballs background isn't fully black? The area that the vectorballs move over is black, but everything else has static noise(like a b/w TV)?

Edit: What about the REPable instructions? Are they not used in the credits, but used in those failing parts of the demo?
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby superfury » 2019-5-19 @ 13:02

I'm just wondering. What would be the best way to go and implement those timings from xtce.h from reenigne's repository?

Looking at my own source code, I at least implemented the base timings that are in the instruction handlers themselves(those ALU generic functions and all other instructions).

Looking further, I might have not implemented those timings in the busInit() function? Perhaps those are the missing timings UniPCemu is missing(which is why it's reporting of PIT cycles from 8088 MPH gets too low)?

Edit: Looking a bit further, it seems all those timings concerning "_accessNumber = " are the timings that are probably missing from UniPCemu(except perhaps the CS/IP-related timings, which don't seem to match at all, perhaps because it's based on one of the earlier replies in this thread instead). So perhaps I would need to take all those timings and add them to UniPCemu's timings for said instruction?
Last edited by superfury on 2019-5-19 @ 13:10, edited 1 time in total.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: UniPCemu cycle accurate 8088 implementation

Postby Scali » 2019-5-19 @ 13:04

superfury wrote:Any idea why the vectorballs background isn't fully black? The area that the vectorballs move over is black, but everything else has static noise(like a b/w TV)?


The vectorballs are double-buffered.
Also, the routine does not clear the entire screen, but only draws black squares in the position of each ball to clear.
It assumes the screen starts all-black (it does a rep stosw after switching to the special 224x140 double-buffered display mode to ensure that it's black). If you have random garbage in video memory at the start, then it might explain the flickering (page flipping) and the fact that you can see the black squares cleaning up the balls.
Scali
l33t
 
Posts: 4364
Joined: 2014-12-13 @ 14:24

PreviousNext

Return to PC Emulation

Who is online

Users browsing this forum: No registered users and 2 guests