So, essentially, all timing is based on the CPU cycles spent each instruction/HLT state. Or is there an error in the timing somewhere?
Well, there must be an error, because you see drift.
A CGA screen should take exactly 76*262 = 19912 PIT ticks. This does not seem to be the case, which is why things drift away.
Aside from getting the number of ticks correctly, for our effects to work, you also need to have the vsync and hsync timing correct. As in, the PIT is reset to the vsync state. The code assumes that the PIT is then in sync with the display, and the counter value corresponds to a specific place on screen. So that means your vsync states need to occur at the same time as they do on real hardware.
CPU timing is not that critical for these effects.
exact number is a floating point division to be accurate
That's not necessarily the most accurate way to go about it. If you time everything in hdots (1 hdot is 1/4 of a colour carrier cycle, 157500000 hdots is exactly 11 seconds) then you can do it all with integers and get perfect accuracy. One CPU clock cycle is 3 hdots, one PIT cycle is 12 hdots, one CRTC character period is 8 or 16 hdots.
superfury wrote:
So, essentially, all timing is based on the CPU cycles spent each instruction/HLT state. Or is there an error in the timing somewhere?
The PIT/CRTC desynchronization looks too severe to be just floating-point errors (unless you're losing a lot of accuracy somewhere by doing something silly like adding very small numbers to very large ones). So you may have another timing bug. Just to check, you should see exactly 19912 PIT cycles occur in the time of one CGA frame (912*262 hdots).
Is this CPU completely accurate? Are the (i)div/(i)mul timings accurate? If so, I can fix my 8088 BIU and (i)div/(i)mul instructions based on it? Thus this might fix the current 4% off by 4.77MHz?
Reenigne? Scali?
Btw, all timing used is 1000000000.0(ns/second, double floating point format) divided by the clock to use(~4.77M(CPU, DMA, actually ~14M/3.0), 1.19M(PIT, essentially 14M/12.0), ~14M(CGA Pixel clock)). This number is added to each device(CPU adds to master clock instead, which is synchronized to realtime to provide emulation speed).
It does not appear to be, based on a cursory inspection. Specifically, there doesn't seem to be a "1 cycle per set bit" penalty for the multiplies. In general, I'd also expect a cycle-exact emulator to do a lot more cycle counting than this seems to.
superfury wrote:
~14M/3.0
When you say "~14M" do you mean "14000000", "14318182" or "157500000/11"?
superfury wrote:
1.19M
Similarly, by "1.19M" do you mean "1190000", "1193182" or "13125000/11"?
1#define MHZ14 ((15.75/1.1)*1000000.0) 2//The clock speed of the 8086 (~14.31818MHz divided by 3)! 3#define CPU808X_CLOCK (MHZ14/3.0f) 4 5double DMA_Frequency = (1000000000.0 / (MHZ14/8.732575)); //DMA tick time, based on ISA clock(is supposed to be divided by 3.0), originally divided by 8.732575, which gives close to 1.6MB/s! 6//The clock speed of the PIT (14.31818MHz divided by 12)! 7#define TIME_RATE (MHZ14/12.0f) 8float getCGAMDAClock(VGA_Type *VGA) 9{ 10 float result=0.0f; //The calculated clock speed! Default: not used! 11 if (CGAMDAEMULATION_ENABLED_CRTC(VGA)) //Are we enabled? 12 { 13 if (CGAEMULATION_ENABLED_CRTC(VGA)) //CGA emulation enabled? 14 { 15 //15.75MHz originally used, but seems to be 14.31818MHz according to reenigne.org and https://pineight.com/mw/index.php?title=Dot_clock_rates 16 result = (float)(MHZ14); //Special CGA compatibility mode: change our refresh speed to match it according to CGA specifications! Pixel speed is always 14MHz! 17 } 18 else if (MDAEMULATION_ENABLED_CRTC(VGA)) //MDA emulation enabled? 19 { 20 result = 16257000.0f; //16.257MHz pixel clock! 21 } 22 } 23 return result; //Default: No CGA/MDA clock used! 24}
These are all the clocks used, that were mentioned in my last post.
Okay, MHZ14 and TIME_RATE seem to be nominally correct, at least. So either your CRTC frame isn't taking exactly 912*262 hdots, or you've got floating-point precision problems, or there's something else afoot. The first I think you should be able to check. As for the second, you're using "double variable counting nanoseconds passed" - how large do the values in these timer variables get?
When used, except for the CPU timer that synchronized to real time(high resolution clock) to make the emulation run at normal speed, all clocks simply perform 3 things:
First, the device's timer is increased with the amount of nanoseconds the CPU has spent on it's current instruction.
Then the timer is divided by the frequency(which is the amount of nanoseconds for each time it needs to process a pixel(or retrace or whatever the current pixel clock is supposed to do), PIT timer tick(decreasing counters), audio output(depends on the audio device), transfer a byte of data(DMA) etc.
Then that timer has the ticks to be processed substracted from it, with a reverse division(e.g. at 1000Hz, the timer gets divided by the time for a tick to get the amount of ticks(which is then rounded down). So a divide by 1000000, then rounded down for the ticks to process. Then, the rounded down value gets multiplied to get the time processed(so for 1 tick, it generates 1000000ns) and substracted from the timing to get the rest time(the time to be added to for the next time time is added) to make the timer go in sync with the source time(CPU timer).
Finally, the ticks calculated are used to tick the device (for loop which processed ticks and decreases the ticks each time it's finished processing a tick).
For a simple example of this, look at pit.c, which does what's described above, based on the CPU time passed(just like all other hardware does).
The global timing(The timing the CPU synchronized to, as does all the hardware by extension of the CPU timings) gets as large as the amount of nanoseconds the emulator runs(which adds 1000000000.0 each second emulated).
The timing used for each hardware device goes as high as the highest CPU execution time for one instruction(so for, lets say, 200 cycles(extremely long (I)MUL/(I)DIV instruction, which are the longest lasting instructions afaik(with DIV0 exception) it's (1000000000.0/(MHZ14/3.0))*200 = (1000000000/(((15.75/1.1)*1000000)/3))*200 = 41904.7619047619ns on 4.77MHz clock(calculated with Windows 10 calculator, which uses double precision as well afaik).
Just broken in the debugger while the effect is running:
The VGA Horizontal Total is 40 character clocks(so 320 pixels rendered horizontally in total). The horizontal x counter is 913 at the end of each scanline, so 912 pixel clocks are rendered(each clock takes one pixel clock and increases it). Also one clock is taken by the horizontal retrace clock? Positions 0-911 are normal display clocks, clock 912 resets the counter and does nothing(missing a clock).
The Vertical timing is 1, so it's working on a scanline basis?
The vertical line counter is set to 224 when each frame renders.
The display resolution of each frame is 752x246 pixels(including overscan etc., which is determined by the horizontal and vertical retrace starting).
The vertical counter at Vertical Total is 262.
The clocks runs at 4x PIT speed, so (262*913)(time of a frame in CGA pixel clocks)/4=59801.5 PIT ticks.
This would be a whole number if the totalling and retracing didn't last 1 pixel clock to execute(so substract 1 pixel clock for every scanline and 1 pixel clock for a complete screen).
Edit: I've modified the HTotal and VTotal signals to execute the next clock(The first clock of the next Scanline's pixel) immediately. So now it shouldn't skip dot clocks on horizontal/vertical overflow. But the black bars still appear.
Edit: The Delorean car doesn't seem to disappear anymore now:)
The global timing(The timing the CPU synchronized to, as does all the hardware by extension of the CPU timings) gets as large as the amount of nanoseconds the emulator runs(which adds 1000000000.0 each second emulated).
I was afraid of that. This means that the timing of your emulator will get less accurate the longer it runs. If you're careful to use double-precision for all calculations with these values, you might get away with it - after 10 minutes, your timing accuracy is still going to be something in the range of 1/1000000 of one hdot, and it'll take 20 years for the precision to reach 1 hdot. If you use single-precision (float) then after 10 minutes it's only going to be accurate to about 1 scanline (and will probably stop working entirely before then).
Well, almost, there's one difference:
The global timing is only used to slow down the CPU to real time(e.g. it only has the effect of using less delays for each block of CPU instruction executed).
All hardware timing is only dependent on the current execution executed, so they won't be affected like you say. Essentially the emulation works in 3 steps(inner execution loop):
1. Execute a CPU instruction and count cycles. The cycles is multiplied by the duration of one CPU cycle(setable by the user(Dosbox clocks) or 4.77MHz) in nanoseconds. This results in the duration of the current instruction in nanoseconds.
2. Every time a CPU instruction has been executed, the duration of the instruction that was executed is added to each of the other timers(hardware) and used to time their functions(like pixel clock etc. as explained with the PIT ticks).
3. Finally, the instruction time in nanoseconds is added to the realtime(nanosecond) counter to calculate the current point in time of the emulated machine.
The outer execution loop(executed after a block of instructions, just like Dosbox has it's cycles loop for the inner loop) only check the actual time executed, by retrieving it from a high resolution clock, and waits for the high resolution clock to reach the CPU time(this synchronizes the block with realtime). After that it updates keyboard and mouse input, renders changed display to the screen(which is prerendered at the current resolution), delays a bit every second to allow other threads to update(debugger, BIOS Settings menu threads). After that it starts the next block of the inner loop.
The outer loop(which also synchonized to realtime(the delay(0); instruction):
1byte BIOSMenuAllowed = 1; //Are we allowed to open the BIOS menu? 2 //CPU execution, needs to be before the debugger! 3 uint_64 currentCPUtime = getnspassed_k(&CPU_timing); //Current CPU time to update to! 4 uint_64 timeoutCPUtime = currentCPUtime+TIMEOUT_TIME; //We're timed out this far in the future (1ms)! 5 6 double instructiontime,timeexecuted=0.0f; //How much time did the instruction last? 7 byte timeout = TIMEOUT_INTERVAL; //Check every 10 instructions for timeout! 8 for (;last_timing<currentCPUtime;) //CPU cycle loop for as many cycles as needed to get up-to-date! 9 { 10 if (debugger_thread) 11 { 12 if (threadRunning(debugger_thread)) //Are we running the debugger? 13 { 14 return 1; //OK, but skipped! 15 } 16 } 17 if (BIOSMenuThread) 18 { 19 if (threadRunning(BIOSMenuThread)) //Are we running the BIOS menu and not permanently halted? Block our execution! 20 { 21 if ((CPU[activeCPU].halt&2)==0) //Are we allowed to be halted entirely? 22 { 23 return 1; //OK, but skipped! 24 } 25 BIOSMenuAllowed = 0; //We're running the BIOS menu! Don't open it again! 26 } 27 } 28 if ((CPU[activeCPU].halt&2)==0) //Are we running normally(not partly ran without CPU from the BIOS menu)? 29 { 30 BIOSMenuThread = NULL; //We don't run the BIOS menu anymore! 31 } 32 33 if (allcleared) return 0; //Abort: invalid buffer! 34 35 interruptsaved = 0; //Reset PIC interrupt to not used! 36 if (!CPU[activeCPU].registers) //We need registers at this point, but have none to use? 37 { 38 return 0; //Invalid registers: abort, since we're invalid! 39 } 40 if (CPU[activeCPU].halt) //Halted? 41 { 42 if (romsize) //Debug HLT? 43 { 44 MMU_dumpmemory("bootrom.dmp"); //Dump the memory to file! 45 return 0; //Stop execution! 46 } 47 48 if (CPU[activeCPU].halt & 0xC) //CGA wait state is active? 49 { 50 if ((CPU[activeCPU].halt&0xC) == 8) //Are we to resume execution now? 51 { 52 CPU[activeCPU].halt &= ~0xC; //We're resuming execution! 53 goto resumeFromHLT; //We're resuming from HLT state! 54 } 55 goto skipHaltRestart; //Count cycles normally! 56 } 57 else if (CPU[activeCPU].registers->SFLAGS.IF && PICInterrupt() && ((CPU[activeCPU].halt&2)==0)) //We have an interrupt? Clear Halt State when allowed to! 58 { 59 CPU[activeCPU].halt = 0; //Interrupt->Resume from HLT 60 goto resumeFromHLT; //We're resuming from HLT state!
…Show last 114 lines
61 } 62 else 63 { 64 skipHaltRestart: 65 if (DosboxClock) //Execute using Dosbox clocks? 66 { 67 CPU[activeCPU].cycles = 1; //HLT takes 1 cycle for now! 68 } 69 else //Execute using actual CPU clocks! 70 { 71 CPU[activeCPU].cycles = 1; //HLT takes 1 cycle for now, since it's unknown! 72 } 73 } 74 if (CPU[activeCPU].halt==1) //Normal halt? 75 { 76 //Increase the instruction counter every instruction/HLT time! 77 cpudebugger = needdebugger(); //Debugging information required? Refresh in case of external activation! 78 if (cpudebugger) //Debugging? 79 { 80 debugger_beforeCPU(); //Make sure the debugger is prepared when needed! 81 debugger_setcommand("<HLT>"); //We're a HLT state, so give the HLT command! 82 } 83 debugger_step(); //Step debugger if needed, even during HLT state! 84 } 85 } 86 else //We're not halted? Execute the CPU routines! 87 { 88 resumeFromHLT: 89 if (CPU[activeCPU].registers && doEMUsinglestep) //Single step enabled? 90 { 91 if (getcpumode() == (doEMUsinglestep - 1)) //Are we the selected CPU mode? 92 { 93 switch (getcpumode()) //What CPU mode are we to debug? 94 { 95 case CPU_MODE_REAL: //Real mode? 96 singlestep |= ((CPU[activeCPU].registers->CS == (singlestepaddress >> 16)) && (CPU[activeCPU].registers->IP == (singlestepaddress & 0xFFFF))); //Single step enabled? 97 break; 98 case CPU_MODE_PROTECTED: //Protected mode? 99 case CPU_MODE_8086: //Virtual 8086 mode? 100 singlestep |= ((CPU[activeCPU].registers->CS == singlestepaddress >> 32) && (CPU[activeCPU].registers->EIP == (singlestepaddress & 0xFFFFFFFF))); //Single step enabled? 101 break; 102 default: //Invalid mode? 103 break; 104 } 105 } 106 } 107 108 HWINT_saved = 0; //No HW interrupt by default! 109 CPU_beforeexec(); //Everything before the execution! 110 if (!CPU[activeCPU].trapped && CPU[activeCPU].registers) //Only check for hardware interrupts when not trapped! 111 { 112 if (CPU[activeCPU].registers->SFLAGS.IF) //Interrupts available? 113 { 114 if (PICInterrupt()) //We have a hardware interrupt ready? 115 { 116 HWINT_nr = nextintr(); //Get the HW interrupt nr! 117 HWINT_saved = 2; //We're executing a HW(PIC) interrupt! 118 if (!((EMULATED_CPU <= CPU_80286) && REPPending)) //Not 80386+, REP pending and segment override? 119 { 120 CPU_8086REPPending(); //Process pending REPs normally as documented! 121 } 122 else //Execute the CPU bug! 123 { 124 CPU_8086REPPending(); //Process pending REPs normally as documented! 125 CPU[activeCPU].registers->EIP = CPU_InterruptReturn; //Use the special interrupt return address to return to the last prefix instead of the start! 126 } 127 call_hard_inthandler(HWINT_nr); //get next interrupt from the i8259, if any! 128 } 129 } 130 } 131 cpudebugger = needdebugger(); //Debugging information required? Refresh in case of external activation! 132 MMU_logging = debugger_logging(); //Are we logging? 133 CPU_exec(); //Run CPU! 134 135 //Increase the instruction counter every instruction/HLT time! 136 debugger_step(); //Step debugger if needed! 137 138 CB_handleCallbacks(); //Handle callbacks after CPU/debugger usage! 139 } 140 141 //Update current timing with calculated cycles we've executed! 142 instructiontime = CPU[activeCPU].cycles*CPU_speed_cycle; //Increase timing with the instruction time! 143 last_timing += instructiontime; //Increase CPU time executed! 144 timeexecuted += instructiontime; //Increase CPU executed time executed this block! 145 tickPIT(instructiontime); //Tick the PIT as much as we need to keep us in sync! 146 updateDMA(instructiontime); //Update the DMA timer! 147 updateMouse(instructiontime); //Tick the mouse timer if needed! 148 stepDROPlayer(instructiontime); //DRO player playback, if any! 149 if (BIOS_Settings.useAdlib) updateAdlib(instructiontime); //Tick the adlib timer if needed! 150 updateATA(instructiontime); //Update the ATA timer! 151 tickParallel(instructiontime); //Update the Parallel timer! 152 if (BIOS_Settings.useLPTDAC) tickssourcecovox(instructiontime); //Update the Sound Source / Covox Speech Thing if needed! 153 updateVGA(instructiontime); //Update the VGA timer! 154 if (--timeout==0) //Timed out? 155 { 156 timeout = TIMEOUT_INTERVAL; //Reset the timeout to check the next time! 157 if (getnspassed_k(&CPU_timing) >= timeoutCPUtime) break; //Timeout? We're not fast enough to run at full speed! 158 } 159 } //CPU cycle loop! 160 161 //Slowdown to requested speed if needed! 162 for (;getnspassed_k(&CPU_timing) < last_timing;) delay(0); //Update to current time every instruction according to cycles passed! 163 164 updateKeyboard(timeexecuted); //Tick the keyboard timer if needed! 165 166 //Check for BIOS menu! 167 if (psp_keypressed(BUTTON_SELECT)) //Run in-emulator BIOS menu requested? 168 { 169 if (!is_gamingmode() && !Direct_Input && BIOSMenuAllowed) //Not gaming/direct input mode and allowed to open it(not already started)? 170 { 171 BIOSMenuThread = startThread(&BIOSMenuExecution,"BIOSMenu",NULL); //Start the BIOS menu thread! 172 delay(0); //Wait a bit for the thread to start up! 173 } 174 }
The last_timing variable synchronizes the CPU(actually the whole emulation) clock to realtime(High resolution clock). The CPU_speed_cycle variable contains the duration of one CPU cycle in nanoseconds(depending on how the emulator is configured. It's set to the duration of one CPU cycle at 4.77MHz when set to it's Default configuration. So if anything slows down, it's only the complete emulation(which is based on the CPU), not the hardware. The hardware always runs as fast as the CPU runs, since they're based on their own counters, which are based on the instructiontime variable, which is the time spent on the current instruction(CPU instruction cycles times the duration of one cycle in nanoseconds). The counters within hardware are increased with the same value(instructiontime variable), but those counters will reset themselves periodically(they wrap around their cycle timing in nanoseconds. So if a device runs at 1000Hz, it will wrap around to 0.0 every 1000000.0 nanoseconds.
So the only thing that can actually slow down is the emulator itself(since it can't use that kind of wrapping, as it needs the total running time in nanoseconds to synchronize itself to Windows/Linux/PSP high resolution clock).
The code that slows down emulation to realtime:
1 //Slowdown to requested speed if needed! 2 for (;getnspassed_k(&CPU_timing) < last_timing;) delay(0); //Update to current time every instruction according to cycles passed!
Note, the emulator timing will go wrong after it's been executing for more than getnspassed_k and/or last_timing can contain in their results. But this will currently only be after the full percision can't contain their numbers anymore, which can be a long time? It's currently being used at two points: The main loop(to synchronize to actual time) and the CPU percentage shown in the top left corner of the video. The getnspassed_k is essentially a call to the normal getnspassed(which does update the container's timepoint to current time), but getnspassed_k discards those changes to keep the starting point intact.
If I changed the getnspassed_k loop to a normal getnspassed which decreases itself and last_timing (but increases an extra variable with that value for displaying the CPU speed), this problem will have been solved, and the emulation can virtually run forever, without this problem occurring.
Well, after all that work, at least something has improved with recent commits(according to the CGA Compatibility Tester) 😀
The attachment 235-x86EMU_CGARefreshRate.png is no longer available
Although the problem with the car is back, apparently. The only problem to actually occur(besides timing not being 100%) in the Compatibility Tester is the Start Address reprogramming test. The first screen is OK, but all other screens are vertically shifted down/up by 1 line(scrolling left) or horizontally shifted right/left by 1/3 screen(scrolling up). Does that mean there's a problem in my Start Address calculations?
Edit: Btw, I've modified the core to use timings the same way the hardware is synchronized. I've also adjusted the CPU speed percentage to use this new method(about the same as the hardware). That should fix that 1000000000.0/second accuracy issue, while keeping it synchronized to realtime. So now that 20 year main emulation timer clock shouldn't apply anymore. It should be able to run at full accuracy virtually forever now(this applies to the high precision timer since it was built that way too, protecting with overflows becoming reversed(and thus time returned will handle correctly. So instead of newtime-oldtime(normal difference), it'll become newtime+((~0)-oldtime))(which is the difference when carried into bit 64 of the 64-bit time).
The only thing that might eventually cause problems is the 32-bit IRQ0-timer that will eventually overflow(after about 4 billion(2^32) times 55ms), so about 7.47 years. Or 1 hour with a 16-bit counter(1 hour and 0.87 seconds to be exact). Although this is a limit of the BIOS, not of the emulator itself.
I was just looking at the direct NTSC recording of the final version of 808MPH's Kefrens Bars effect, when I noticed that, of the 'chinese wall'-like effect(which will display as a single line becoming wider and smaller on the VGA emulation, does actually overflow one or two scanlines on a black background(it looks like it's some sort of 'ghosting' of the bottommost scanline output). This happens in x86EMU too. I always thought that was a byproduct of not being 100% cycle accurate yet(about 96% there, with 4% lower than required according to 8088 MPH, at 1604 cycles). It appears this might be a problem in 8088MPH itself? You can clearly see the 'ghosting' at the first black scanline at the bottom of the output(the rasterlines are black for that scanline).
If it happens on the capture, then it happens on real hardware! The bottom scanline or two are displayed in a different palette - is that what you're talking about? I forget exactly why now. It might be some left-over debugging code that was never bothersome enough to remove.
What I mean is that the bottom scanline has the bottom scanline of that chinese-wall effect(which moves left&right), but has a black background instead of the vertical raster bars(palette index 0) always, making it look like there's a scanline too much, with the bottom sticking out of the active display area(since it happens on x86EMU too, this should be the bottommost scanline on the screen). Maybe it's row 201?
Looking at it again, either there's a 201st scanline, which is a duplicate of scanline 199 or scanline 199 is always a duplicate of scanline 198. In either case, that duplicate scanline always has a black background and border, which leaves me to believe that scanline that should be zeroed(Scanline 200) or the background/foreground routine is one scanline short(Scanline 199).
Just have been messing around with the video controller's rendering again.
It now handles vertical and horizontal total in 0 clocks(so programming it for 80x25 character clocks will actually take that amount of time).
One thing I immediately noticed is that some of the effects were originally causing what seemed like vertical scanline glitches with effects using the method to make scanlines 'buzz off' as you've called it, as was the case with the flower girl effect and the CGA new vs old effect at the beginning.
Then I looked at the CGA documentation again and noticed something: UniPCemu was applying the start map address when it finished handling vertical total, which happens after it loads said address into the first scanline of the new window during the vertical total handling itself... Whoops!
Having fixed that little bug, timing of the CGA should be more accurate now, while taking exactly horizontal total times (vertical total plus the vertical total adjustment) character clocks to render a screen and load the start address at the end of it now(instead of during the next clock, which is on the next frame(and was loaded at the end of said frame, which is way too late)).
Of course that also improves the VGA's method(which loads it during vertical retrace instead), since the VGA modes now take a proper 800 horizontal clocks instead of stopping at the old horizontal retrace clock. 😁
It really needs a genuine IBM CGA card. I think there might be one or two clone CGA cards which have been reported to work as well but it definitely won't work on EGA/VGA/SVGA/later cards.