The main problem I'm having with optimizing my emulator is that I can't find out any ways anymore to make it run even faster than it does currently. I see it reaching a top of about 200-300KIPS in IPS clocking mode on my 4GHz i7-4790K, but cant seem to get much more out of it. How can Dosbox manage to reach 3MIPS with only a few % CPU usage, while mine is at ~15%(the limit it seems) with only 20%p30% of that speed reached?
Quite a few precalcs tables are already used with both memory and segment descriptors etc.?
Anybody has some tips how to get it faster? I notice much of the bottleneck(according to profiling) is the PIQ filling from emulated RAM, where the basic resolving of the memory block and it's physical location seems to take up the most time, according to the profiler(see mmu/mmu.c's directrb/directwb functions)? This is especially the case with memory reads, even though the three most recent memory access(data read, instruction read and data write) are cached seperately.
Do you cache the instruction decoding? I found that makes a big difference.
Be careful of the profiler, sampling can give hints to where it spends times, but it won't take into account jumps the cpu can't predict.
For example, I followed what Dosbox does pretty closely when it comes to memory reads in BoxedWine
If you have a memory page that is either a structure with function pointers for reads/writes or a c++ class with virtual functions, this will be really slow. Instead I have a separate array of memory offsets from the page's memory to the system memory and if it is a normal read or a write that isn't being watched by the code cache, then it becomes a simple if statement, and since this if statement usually passes, it works well with branch prediction.
Here is my readd, I tried to avoid the c++ virtual call on the page by doing a check if it is a simple offset.
Even if that approach doesn't work for you, you need to be careful with function pointers and virtual functions when it comes to memory reads/writes.
With a 32-bit build that uses just c++ (no dynamic generated asm at runtime), I get on my i7 6700 about 215 MIPS and the Quake 2 test is 7.0 fps (640x480 software)
If I remove the if statement on all my reads/writes I get 185MIPS and the Quake 2 test is 5.9 fps. So that one if statement on my reads/writes improved performance by about 15%.
With a simple JIT dynamic cpu, my performance goes up to about 400MIPS with a quake 2 test result of 14.0 fps. The JIT looks for code blocks that are called a lot, then generates ASM and replaces the old block with the custom one. This saves functions calls (one virtual function call per instruction) and I can check if CPU flags can be ignored since another instruction in the same block will overwrite it before it is used.
If it were me, I would have compiler directives defined for logging and not run-time checks.
Well, the logging checks aren't compiler-determined. It's affected by the settings of the emulation that's changable by the user at run-time(e.g. breakpoint address, different modes etc.). When the debugger is disabled, it will barely(if at all) show up on the profiler.
I also split the RAM into 64K chunks for the different RAM areas(0-640K, 1M-15M, 16M-3G, 4G+(unused)). Each 64K block will, when changing addresses(instruction, data read or data write kept seperately buffered until a new chunk is requested) precalculate some information(accessability(for memory holes vs RAM), RAM array offset of backing storage and some administration flags) and store it in the cache(essentially a 3-entry array(one entry for each of the following accesses: instructions, data reads, data writes), each entry containing said data that's precalculated). All following accesses will use the cached values instead(barely taking any time).
It's the applyMemoryHoles function that takes the most time.
That's part of the problem: memory is byte-addressable. Unaligned word/dwords are used after all(with memory wrapping around).
And even then, the BIU emulation reduces all memory accesses to byte ones(only joining them together after reading all 2(word)/4(dword) of them(actually during fetching, since it does it by shifting to the byte position, then or-ing into the result for reads(shifting and then writing a byte for writes).
The same counts for the random addresses(byte-aligned) of the DMA controller. Although it's broken up by the memory unit itself(memory_direct[r/w][w/d] functions).
The PIQ fetching/filling routine works in byte-sized units as well, although optimized for multiple-byte data by first checking the largest range(full PIQ) against CS descriptor(limit etc.) and paging, reducing the precalculated end position by one byte until both the start and end bytes are readable, then in a loop fetching all those(as possible) in one fell swoop without segmentation/paging checks).
There would be a problem with accesses that were done at address 9ffff+(word) and 9fffd(dword), because the last byte(s) are in a memory mapped I/O device(VGA) instead of RAM. Or any possible PCI device that's at such an address.
The current hot path for the emulator booting and running various applications(Winter games, Jazz Jackrabbit, Doom 2) at 3MIPS speed is:
DoEmulator: 90.32% (remainder is updating input/output).
Within that: coreHandler: 90.30%
Within that: CPU_exec: 57.33% (updateVGA is 9.33%, updateDMA: 3.91%, updateAudio: 3.02%)
Within CPU_exec: CPU_readOP_prefix: 38.73%, CPU_OP: 14.15%.
Within CPU_readOP_prefix: CPU_readOP: 31.89%, modrm_readparams: 3.58%, CPU_readOPw: 0.73%, CPU_readOPdw: 0.31%
Within CPU_readOP(also called by the CPU_readOPw etc.): BIU_dosboxTick: 29.10%, checkMMUaccess: 2.26%, readfifobuffer: 0.50%, CODE_SEGMENT_DESCRIPTOR_D_BIT: 0.16%.
Within BIU_dosboxTick: BIU_directrb: 14.41%, writefifobuffer: 3.07%, fifobuffer_freesize: 2.02%, checkMMUaccess: 1.99%.
Within BIU_directrb: MMU_INTERNAL_directrb_realaddr: 13.85%, other lines pretty much 0%.
Within MMU_INTERNAL_directrb_realaddr: MMU_INTERNAL_directrb: 5.26%, OPTROM_readhandler: 2.03%, BIOS_readhandler: 1.51%, VGAmemIO_rb: 0.94%.
Within MMU_INTERNAL_directrb_realaddr: applyMemoryHoles: 1.17%, the is_debugging check after reading MMU.memory[realaddress] into the result: 1.11%.
Within applyMemoryHoles: Loading originaladdress at the start: 0.16%, The maskaddress comparision(line 420): 0.36%, Both Line 434 and 482: 0.13%.
That's the basic things the Visual Studio Community 2019 profiler tells me.
That's at commit 5a2d18d198c5ea77fd22fec05a59e2cf263df75b (of 2019-10-04 19:07).
Managed to optimize it a bit by reordering the is_debugging variable to be loaded after the specialreadcycle label and after loading the result from MMU.memory[realaddress].
That seems to pretty much optimize the is_debugging stuff further away, from 1.11%+0.02%(=1.13%) down to to 0.63%. The whole function now taking 3.19%. Takes another 5.26-3.19=2.07% less now(almost doubled in speed).
Last edited by superfury on 2019-10-17, 08:42. Edited 1 time in total.
I spend a lot of time optimizing, it is a very slow process, sometimes I would spend a lot of time going down a path I thought would help and it wouldn't. The biggest thing I learned was that if statements that are usually false or true (it doesn't flip randomly), is very fast because of branch prediction. Indirect jumps, like virtual function calls and function pointers where not all instances of the object have the same function pointer, are very slow.
And if you happen to ever target Emscripten, avoid case statements 😀
One other thing that helps a lot is that if you have case statements that the compiler can't automatically turn into jump tables is to make jump tables manually. (anything that is NOT in exact numerical order)
You can reduce needed CPU cycles by around 20% that way.
This is something I am planning on doing on my own DOSBOX fork that I want to start at some point.... when I have time 🤣
I actually did that on a few of the DOSBOX source files years ago but then the code got lost... I actually had it up on sourceforge for quite a while but then deleted the project because all I got were nasty comments about how I should have provided a compiled version.
That 20% CPU cycles reduction was with only a couple of the source files optimized with jump tables. I started going through all the source files a few months ago and have a partial list and I can optimize it a whole lot more than I did back then.
Well, currently RAM reads are about 50-70% slower than BIOS ROM reads.
I'm already using many lookup tables, even precalculating many using prefetched pointers(e.g. looking up the instruction handler from it's full lookup table(using the 0F opcode flag, opcode itself and operand size) and storibg said function address into the current instruction handler pointer to use during the various instruction phases.
Each instruction function does things essentially in 3 phases: load, execute, store(each as needed), keeping various numbers as where it has left off last call, aborting after each stall(e.g. waiting for the BIU, but only in cycle-accurate mode). In the fastest(IPS clocking) mode, it won't abort. Instead it processes all cycles through the BIU until the BIU is finished in a loop. So the BIU ticking is done in the instruction handler instead of in parallel to it).
The main performance hits are mostly RAM/ROM reads and of course protected mode checks(on each LSB and MSB of any MMU access, so skipping the middle 2 bytes of a 32-bit access). The protected mode checks are mostly precalculated(limits, if to reverse(top-down segments), present bit), but still take a lot of execution time.
For fully 32-bit code, you could elide the limit/access checks if the segment has a base of 0, is writable and has a limit of 0xFFFFFFFF.
Not entirely. The limit still faults when an access at FFFFFFFD+(dword) or FFFFFFFF(word) is made. Those overflow the limit even in that situation(the linear address wraps back to 0MB, but that can't happen in practice, due to the CPU still #GP(0)/#SS(0) faulting on the access). Everyone seems to think that the CPU can only generate 32-bit addresses, but it's only like that externally(i.e. paging and hardware). During the limit checks it's actually 33-bit(just truncated when passed through the paging unit). How else would a CPU fault on said addresses with #GP(0)/#SS(0) faults?
Edit: Just did make a small speedup in the paging checks of loading GDT/LDT/IDT descriptors. It now only checks paging against reads for the first and last byte(ignoring everything in between). Writes are unaffected(as GDT/LDT only have one byte written(the access rights byte)).
That saves 6 memory checks out of 8 for each descriptor fetched from an MMU device(so that counts for both ROMS, RAM and MMIO devices(everything responding to the memory accesses by the CPU and BIU).