Improving x86 emulation speed(optimizations)?

Emulation of old PCs, PC hardware, or PC peripherals.

Improving x86 emulation speed(optimizations)?

Postby superfury » 2019-10-08 @ 19:31

The main problem I'm having with optimizing my emulator is that I can't find out any ways anymore to make it run even faster than it does currently. I see it reaching a top of about 200-300KIPS in IPS clocking mode on my 4GHz i7-4790K, but cant seem to get much more out of it. How can Dosbox manage to reach 3MIPS with only a few % CPU usage, while mine is at ~15%(the limit it seems) with only 20%p30% of that speed reached?

Quite a few precalcs tables are already used with both memory and segment descriptors etc.?

Anybody has some tips how to get it faster? I notice much of the bottleneck(according to profiling) is the PIQ filling from emulated RAM, where the basic resolving of the memory block and it's physical location seems to take up the most time, according to the profiler(see mmu/mmu.c's directrb/directwb functions)? This is especially the case with memory reads, even though the three most recent memory access(data read, instruction read and data write) are cached seperately.

Anyone?
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: Improving x86 emulation speed(optimizations)?

Postby cyclone3d » 2019-10-08 @ 20:32

You could remove the logging checks.

If it were me, I would have compiler directives defined for logging and not run-time checks.
User avatar
cyclone3d
l33t
 
Posts: 3364
Joined: 2015-4-08 @ 06:06
Location: Huntsville, AL USA

Re: Improving x86 emulation speed(optimizations)?

Postby danoon » 2019-10-08 @ 21:23

Do you cache the instruction decoding? I found that makes a big difference.

Be careful of the profiler, sampling can give hints to where it spends times, but it won't take into account jumps the cpu can't predict.

For example, I followed what Dosbox does pretty closely when it comes to memory reads in BoxedWine

If you have a memory page that is either a structure with function pointers for reads/writes or a c++ class with virtual functions, this will be really slow. Instead I have a separate array of memory offsets from the page's memory to the system memory and if it is a normal read or a write that isn't being watched by the code cache, then it becomes a simple if statement, and since this if statement usually passes, it works well with branch prediction.

Here is my readd, I tried to avoid the c++ virtual call on the page by doing a check if it is a simple offset.

Code: Select all
inline U32 readd(U32 address) {
    if ((address & 0xFFF) < 0xFFD) {
        int index = address >> 12;
        if (Memory::currentMMUReadPtr[index])
            return *(U32*)(&Memory::currentMMUReadPtr[index][address & 0xFFF]);
        return Memory::currentMMU[index]->readd(address);
    } else {
        return readb(address) | (readb(address+1) << 8) | (readb(address+2) << 16) | (readb(address+3) << 24);
    }
}


Even if that approach doesn't work for you, you need to be careful with function pointers and virtual functions when it comes to memory reads/writes.

With a 32-bit build that uses just c++ (no dynamic generated asm at runtime), I get on my i7 6700 about 215 MIPS and the Quake 2 test is 7.0 fps (640x480 software)

If I remove the if statement on all my reads/writes I get 185MIPS and the Quake 2 test is 5.9 fps. So that one if statement on my reads/writes improved performance by about 15%.

With a simple JIT dynamic cpu, my performance goes up to about 400MIPS with a quake 2 test result of 14.0 fps. The JIT looks for code blocks that are called a lot, then generates ASM and replaces the old block with the custom one. This saves functions calls (one virtual function call per instruction) and I can check if CPU flags can be ignored since another instruction in the same block will overwrite it before it is used.
danoon
Member
 
Posts: 154
Joined: 2011-1-04 @ 19:12

Re: Improving x86 emulation speed(optimizations)?

Postby superfury » 2019-10-08 @ 22:22

cyclone3d wrote:You could remove the logging checks.

If it were me, I would have compiler directives defined for logging and not run-time checks.


Well, the logging checks aren't compiler-determined. It's affected by the settings of the emulation that's changable by the user at run-time(e.g. breakpoint address, different modes etc.). When the debugger is disabled, it will barely(if at all) show up on the profiler.

I also split the RAM into 64K chunks for the different RAM areas(0-640K, 1M-15M, 16M-3G, 4G+(unused)). Each 64K block will, when changing addresses(instruction, data read or data write kept seperately buffered until a new chunk is requested) precalculate some information(accessability(for memory holes vs RAM), RAM array offset of backing storage and some administration flags) and store it in the cache(essentially a 3-entry array(one entry for each of the following accesses: instructions, data reads, data writes), each entry containing said data that's precalculated). All following accesses will use the cached values instead(barely taking any time).
It's the applyMemoryHoles function that takes the most time.

https://bitbucket.org/superfury/unipcem ... #lines-413
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: Improving x86 emulation speed(optimizations)?

Postby danoon » 2019-10-09 @ 00:21

That is a lot of code for each read and a 32-bit read/write will call all that code 4 times. Is there a way to do all the checks just once for each 32-bit and 16-bit read/write?
danoon
Member
 
Posts: 154
Joined: 2011-1-04 @ 19:12

Re: Improving x86 emulation speed(optimizations)?

Postby superfury » 2019-10-09 @ 05:08

That's part of the problem: memory is byte-addressable. Unaligned word/dwords are used after all(with memory wrapping around).

And even then, the BIU emulation reduces all memory accesses to byte ones(only joining them together after reading all 2(word)/4(dword) of them(actually during fetching, since it does it by shifting to the byte position, then or-ing into the result for reads(shifting and then writing a byte for writes).

The same counts for the random addresses(byte-aligned) of the DMA controller. Although it's broken up by the memory unit itself(memory_direct[r/w][w/d] functions).

The PIQ fetching/filling routine works in byte-sized units as well, although optimized for multiple-byte data by first checking the largest range(full PIQ) against CS descriptor(limit etc.) and paging, reducing the precalculated end position by one byte until both the start and end bytes are readable, then in a loop fetching all those(as possible) in one fell swoop without segmentation/paging checks).

There would be a problem with accesses that were done at address 9ffff+(word) and 9fffd(dword), because the last byte(s) are in a memory mapped I/O device(VGA) instead of RAM. Or any possible PCI device that's at such an address.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: Improving x86 emulation speed(optimizations)?

Postby superfury » 2019-10-10 @ 19:50

The current hot path for the emulator booting and running various applications(Winter games, Jazz Jackrabbit, Doom 2) at 3MIPS speed is:
DoEmulator: 90.32% (remainder is updating input/output).
Within that: coreHandler: 90.30%
Within that: CPU_exec: 57.33% (updateVGA is 9.33%, updateDMA: 3.91%, updateAudio: 3.02%)
Within CPU_exec: CPU_readOP_prefix: 38.73%, CPU_OP: 14.15%.
Within CPU_readOP_prefix: CPU_readOP: 31.89%, modrm_readparams: 3.58%, CPU_readOPw: 0.73%, CPU_readOPdw: 0.31%
Within CPU_readOP(also called by the CPU_readOPw etc.): BIU_dosboxTick: 29.10%, checkMMUaccess: 2.26%, readfifobuffer: 0.50%, CODE_SEGMENT_DESCRIPTOR_D_BIT: 0.16%.
Within BIU_dosboxTick: BIU_directrb: 14.41%, writefifobuffer: 3.07%, fifobuffer_freesize: 2.02%, checkMMUaccess: 1.99%.
Within BIU_directrb: MMU_INTERNAL_directrb_realaddr: 13.85%, other lines pretty much 0%.
Within MMU_INTERNAL_directrb_realaddr: MMU_INTERNAL_directrb: 5.26%, OPTROM_readhandler: 2.03%, BIOS_readhandler: 1.51%, VGAmemIO_rb: 0.94%.
Within MMU_INTERNAL_directrb_realaddr: applyMemoryHoles: 1.17%, the is_debugging check after reading MMU.memory[realaddress] into the result: 1.11%.
Within applyMemoryHoles: Loading originaladdress at the start: 0.16%, The maskaddress comparision(line 420): 0.36%, Both Line 434 and 482: 0.13%.

That's the basic things the Visual Studio Community 2019 profiler tells me.

That's at commit 5a2d18d198c5ea77fd22fec05a59e2cf263df75b (of 2019-10-04 19:07).
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: Improving x86 emulation speed(optimizations)?

Postby superfury » 2019-10-10 @ 21:39

Managed to optimize it a bit by reordering the is_debugging variable to be loaded after the specialreadcycle label and after loading the result from MMU.memory[realaddress].

That seems to pretty much optimize the is_debugging stuff further away, from 1.11%+0.02%(=1.13%) down to to 0.63%. The whole function now taking 3.19%. Takes another 5.26-3.19=2.07% less now(almosst doubled in speed).
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: Improving x86 emulation speed(optimizations)?

Postby danoon » 2019-10-11 @ 16:14

I spend a lot of time optimizing, it is a very slow process, sometimes I would spend a lot of time going down a path I thought would help and it wouldn't. The biggest thing I learned was that if statements that are usually false or true (it doesn't flip randomly), is very fast because of branch prediction. Indirect jumps, like virtual function calls and function pointers where not all instances of the object have the same function pointer, are very slow.

And if you happen to ever target Emscripten, avoid case statements :happy:
danoon
Member
 
Posts: 154
Joined: 2011-1-04 @ 19:12

Re: Improving x86 emulation speed(optimizations)?

Postby cyclone3d » 2019-10-11 @ 17:34

One other thing that helps a lot is that if you have case statements that the compiler can't automatically turn into jump tables is to make jump tables manually. (anything that is NOT in exact numerical order)

You can reduce needed CPU cycles by around 20% that way.

This is something I am planning on doing on my own DOSBOX fork that I want to start at some point.... when I have time :lol:

I actually did that on a few of the DOSBOX source files years ago but then the code got lost... I actually had it up on sourceforge for quite a while but then deleted the project because all I got were nasty comments about how I should have provided a compiled version.

That 20% CPU cycles reduction was with only a couple of the source files optimized with jump tables. I started going through all the source files a few months ago and have a partial list and I can optimize it a whole lot more than I did back then.
User avatar
cyclone3d
l33t
 
Posts: 3364
Joined: 2015-4-08 @ 06:06
Location: Huntsville, AL USA


Return to PC Emulation

Who is online

Users browsing this forum: No registered users and 3 guests