VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

I'm running UniPCemu's cycle(mostly cycle-accurate, except new 80386+ instructions and stuff like task switching and segment descriptor loading) core of the 80386 using a stock 4.0GHz i7 CPU. But when I run the 80386 Compaq Deskpro 386@16MHz(same speed used on the Inboard 386 XT, while the Inboard 386 AT uses a 32MHz CPU clock instead). The hardware run at a base clock of 14.31818Mhz always, except for hardware using their own clocks(like video cards(even for the CGA, for simple video card compatibilty(even though it runs it's own clock of 14.31818MHz instead of the general clock used by the CPU) with VGA and SVGA emulation) and sound card output(running at the 14MHz clock converted to normal time in nanoseconds to provide a basic timing to be used for outputting the samples easily to the renderer at a fixed rate(like 44.1kHz, depending on the Sound Card, or realtime clocks(the Sound Blaster recording clock is modified to use the actual time the emulator's running to provide recording at realtime without distorting the recorded input by running at a variable rate(depending on the CPU being able to match the realtime speed, which isn't always the case, especially with the heavier CPUs like the 80386+ at 16MHz+))), it runs at only 20% speed(requiring 100% speed to play games at normal speed). Is it normal for a cycle-accurate 16MHz CPU emulation to be so heavy on a Intel i7@4.0GHz CPU? Or does that mean my emulator is badly optimized, for some reason? It's essentially running three heavy clocks on the system in that situation: a 16MHz CPU clock(integer clock which is converted to the 14.31818MHz clock most PC-compatible hardware is based on), a video clock that's running off the CPU clock converted to nanosecond units(double floating point) which runs at different speeds(e.g. 25/28MHz VGA, 14.31818MHz CGA, MDA clock or ET3000/ET4000 SVGA clocks, which can be set by software in the (S)VGA cases) and finally a realtime clock that's directly used by specific hardware(e.g. 44.1kHz Sound Blaster output, CMOS timing, Floppy disk controller(TODO, but planned for some future version supporting physical disk movements for more accuracy), 44.1kHz Game Blaster output, ATA/ATAPI controllers, Joystick timings, Modem timings, Parallel port timings, PIT sound output, PS/2 keyboard timing, PS/2 mouse timing, Sound Source/Covox Speech Thing output, UART timings).

Profiling shows that about 35% is spent in the CPU EU/BIU emulation, 20% in the video card emulation(Plain VGA in this case), remaining units take barely any time, up to 7% each, depending on the hardware.

Is it normal for a 16MHz 80386 cycle-based emulation to be this heavy? Or am I simply optimizing it wrong in some way?

Profiler output from Compaq Deskpro 386 POSTing:

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
36.28 66.76 66.76 _mcount_private
20.46 104.40 37.64 __fentry__
3.57 110.97 6.57 259658824 0.00 0.00 updateVGA
3.39 117.20 6.23 2621155 0.00 0.00 DoEmulator
2.47 121.75 4.55 259633524 0.00 0.00 CPU_tickBIU
1.95 125.33 3.58 319094823 0.00 0.00 VGA_ActiveDisplay_Text
1.86 128.76 3.43 floor
1.59 131.68 2.92 555702049 0.00 0.00 getnspassed
1.53 134.50 2.82 259637130 0.00 0.00 CPU_exec
1.26 136.81 2.31 259661765 0.00 0.00 update8042
1.22 139.05 2.24 259650694 0.00 0.00 updateATA
1.07 141.02 1.97 259637810 0.00 0.00 updateGameBlaster
0.98 142.83 1.81 floorf
0.83 144.36 1.53 304588206 0.00 0.00 checkMMUaccess
0.79 145.82 1.46 304586596 0.00 0.00 CPU_MMU_checklimit
0.71 147.12 1.30 259659161 0.00 0.00 debugger_step
0.66 148.33 1.21 259661051 0.00 0.00 needdebugger
0.65 149.52 1.19 77454927 0.00 0.00 DMA_StateHandler_SI
0.64 150.70 1.18 452976015 0.00 0.00 fifobuffer_freesize
0.64 151.88 1.18 259654778 0.00 0.00 tickPIT
0.62 153.02 1.14 177706134 0.00 0.00 CPU_fillPIQ
0.61 154.15 1.13 229797672 0.00 0.00 BIOS_readhandler

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 4, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Is it normal for a 16MHz 80386 cycle-based emulation to be this heavy? Or am I simply optimizing it wrong in some way?

Yes it is. That is what I am getting too. This is one of the reasons I am stalled in adding support for new stuff in CAPE. I am getting similar speeds: barely maintaining 16Mhz on my Core i7 4790.

At 16Mhz you have to do 16mil ticks of the CPU (EU), BIU, graphics as well as other things like prefetches and so on.

At 4Ghz you have 250 Core i7 cycles to do a 16Mhz clock. That is not a lot considering how much work there is to do.

I am working on some optimizations at the moment but for me, since I am ticking every cycle, every component, there is not much I can do.

Sure I can cheat: I have to wait 7 cycles for something? Why not do cycles+=7? But then my emulator would not be cycle based and then there is no point as we already have PCEm/Dosbox/etc that do a better job than I do.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 2 of 4, by superfury

User metadata
Rank l33t++
Rank
l33t++

Essentially UniPCemu is a hybrid of the two: EU cycles work in blocks(e.g. n cycles to wait, then once those elapse n cycles to wait etc.. Usually N=1 while waiting for BIU transactions to start/complete. N is 4 with NOP, 150+(sum of steps taken) during (I)DIV instructions, while X+(1 to 8/16/32 bits set) during (I)MUL. Others vary depending on the instruction.) The BIU ticks off those timings one by one, processing and advancing T-state accordingly. So it's not as much a single-cycle level as your emulator, but for BIU transactions(and DMA) it is single-cycle. The EU waits 1 cycle periods while waiting for the BIU to become ready or waiting for BIU results. Otherwise, it tells the BIU to process n cycles instead(depending on instruction EU timings as currently used. So most instructions work like this(e.g. ADD): Request data transfer(1 cycle wait times), Request answer(1 cycle wait times), execute and store result in temp storage(n cycles, depending on instruction), Request BIU store(1 cycle wait times), Request BIU result code(1 cycle wait times).

There are some exceptions to the rule, adding BIU delays(e.g. RET instructions) or BIU access delays(jumps).

So it's taking about 250*5=1250 i7 clock cycles to emulate a 80386+ EU/BIU clock(with hardware emulation) at ~15% CPU usage, according to the Windows task manager. So that's pretty optimized already, is it not? Or can it be further optimized without losing accuracy? Anyone got some tips on the emulated CPU/video card optimization? Much is already optimized in various ways(including (un)likely statements), but is such a thing even possible to do(in plain c)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 4, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Essentially UniPCemu is a hybrid of the two: EU cycles work in blocks(e.g. n cycles to wait, then once those elapse n cycles to wait etc.. Usually N=1 while waiting for BIU transactions to start/complete.

Yes that is fine. In fact I am trying to do just that. The problem is not many instructions have that more than 1 cycle of "waiting to execute".

superfury wrote:

So it's taking about 250*5=1250 i7 clock cycles to emulate a 80386+ EU/BIU clock(with hardware emulation) at ~15% CPU usage, according to the Windows task manager.

Why do you multiply by 5? Also the 15% is for the whole CPU, which has multiple cores. So at most you can get 25% utilization since I presume your code is mostly single threaded (where it matters).

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 4 of 4, by superfury

User metadata
Rank l33t++
Rank
l33t++

I multiplied by 5 to get the amount of cycles the i7 spends to emulate a 80386 cycle (and hardware updates, rendering etc.) with UniPCemu. Since the emulator runs at 20% realtime speed, it needs 5 times the time of a 16MHz 80386 clock to update the state of 1 80386 16MHz clock(thus slowing down to 20% speed, as shown by the internal show CPU speed setting). So each second, only 0.2 seconds is emulated with current optimizations.

Windows task manager process status never says higher than 15% on 4 cores(single threaded). That's probably due to up to 1 0us delay each cycle(not delaying when behind realtime(<100% emulation speed indicated within the emulator)).

Indeed, it's mostly single-threaded. The only two other threads are the Settings/Internal POST thread and Debugger thread. Both of them run in their own thread to be able to easily wait and use pressed keys within loops(without hanging the main thread, which handles SDL input and output).

All other things are handled in the main thread.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io