How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Developer's Forum, for discussion of bugs, code, and other developmental aspects of DOSBox.

How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby superfury » 2019-4-08 @ 07:49

How did Dosbox manage to get so much performance out of it's CPU that's in interpreter(non-dynamic recompilation) mode? I can't get UniPCemu past 25%(top 30% if lucky) at 3 MIPS speed.

Anyone got tips on how to improve performance?

I know it uses a big lookup table for it's handling of Paging lookups, but pretty much the same performance can be done with the way UniPCemu handles that(using a simple linked-list with pointers to the first and last entry for speedy lookups(only 32 entries, divided up into 8 entries(4 sets) depending on the address's middle 2 bits(bits 12-13))).
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby Qbix » 2019-4-08 @ 07:53

Are you already using a lazy flags system ? (e.g. only calculate the flags when needed)
Water flows down the stream
How to ask questions the smart way!
User avatar
Qbix
DOSBox Author
 
Posts: 10916
Joined: 2002-11-27 @ 14:50
Location: Fryslan

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby superfury » 2019-4-08 @ 11:28

Not currently. The main issue being that the debugger also reads said flags after each instruction. They don't show up in the profiling(it basically uses the same methods as Bochs does for a large part for calculating those, except not lazily).

UniPCemu arithmetic flags(not all of them, but the common ones not instruction-specific): https://bitbucket.org/superfury/unipcem ... pu/flags.c

The profiling does tell me that the most time(about 8% of CPU execution time, the CPU using about 30%) is spent on the memory accesses. Of that, about 0.8% is spent in the RAM accessing itself(the remainder being the loop that asks any hardware that's connected((S)VGA and BIOS in this case) if it responds to the memory access instead of RAM. That ~0.8% is purely spent on determining memory holes(which is optimized somewhat by using a simple memory structure containing some flags about memory holes and the physical memory Big Page address(essentially a 64K chunk of RAM being pointed to by said structure). Said Big Page address is precalculated somewhat and looked up the first time a Big Page chunk is addressed(or a different chunk is addressed). All memory accessed(as chunks) are remapped to a linear memory array without any memory holes(so 0-640K is direct mapped, 1MB-15MB is mapped directly after that(at 640K+), 16MB-4GB is mapped at (640K+14MB)+ in such a way(based on a Chunk number(like a Paging Frame Number) which is translated to a physical chunk number(Like a Physical Frame)). So essentially it works like Paging, but for some 64K ranges instead(in software, using unchanging static addresses instead of dynamically mapped addresses).

The MMU handling: https://bitbucket.org/superfury/unipcem ... uhandler.c

All byte accesses from the CPU are done using the MMU_INTERNAL_directrb_realaddr and MMU_INTERNAL_directwb_realaddr. Those functions ask the hardware to respond(MMU_IO_readhandler and MMU_IO_writehandler) and if they don't respond, forward the call to MMU_INTERNAL_directrb and MMU_INTERNAL_directwb for the RAM access(this also contains the RAM module's 80C00000 register handling for the Compaq Deskpro 386 memory hardware, which is a special case since it's for the MMU only(not used by any hardware)).

Especially the memory reads impact the emulation a lot(due to many instruction being read to fill the CPU PIQ (which is usually linearly)).
Last edited by superfury on 2019-4-08 @ 18:48, edited 1 time in total.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby BloodyCactus » 2019-4-08 @ 15:29

dosbox cheats, does not check things. ie: in pmode, writing with CS override that is exec only works, on real hardware, obviously it causes an exception.
--/\-[ Stu : Bloody Cactus :: http://kråketær.com :: http://mega-tokyo.com ]-/\--
User avatar
BloodyCactus
Oldbie
 
Posts: 900
Joined: 2016-2-03 @ 13:34
Location: Lexington VA

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby superfury » 2019-4-08 @ 18:47

Won't such a basic protection mechanism missing in Dosbox break software(writing to r/o segments)?

Also, UniPCemu 'cheats' just a little bit, when prefetching the PIQ to become as full as possible, by checking the first byte to fetch, then the last one(to allow page faults to happen), working back in virtual memory(taking limit or 4GB as the max offset) until no page fault is encountered(so it honours the reading until page barrier(4KB chunks) if the next one isn't available). It does so by first checking if the next page(if any) is valid to fetch(for remainder of PIQ bytes to fetch). If it isn't paged(or page faults when supposed to), it rounds fetching down to the current page, allowing the EU(which still handles those) to page fault when it tries to access past said barrier, filling the remainder of the PIQ(usually the full PIQ or something close) it was unable due to being unpaged.

So, an instruction on the page barrier(4KB chunk of linear address space) will prefetch up to FFFh(or until the PIQ is full). Then, when the EU tries to check and fetch x000h(and onwards, depending on the PIQ filled state), the EU will handle the page loading into the TLB(and page fault, if any) and recall the prefetch routine explained above, which will fetch bytes x000+ into the PIQ(since it's paged now).

The memory access scheme(mmu handlers) are optimized somewhat for linear accesses within 64kb memory areas(see source code of mmuhandler.c) within the memory hole and mapping function(using a small cache data structure to store the most recently read and write memory translation(containing frame(64k) address to substract(to map memory to actual RAM array without memory holes in it, saving memory for each hole(size of the memory holes themselves adding memory after it instead) for the RAM area(if not a memory hole), memory hole status(which memory hole(for special Compaq BIOS ROM shadow area at (FF)E0000-(FF)FFFFF).

So, 0-640K is direct mapped, 640K-1M is at 1M+, 1M-almost 16M is after that, the remainder of RAM being at 16M-3G and after 4GB(for any more RAM). So 3 memory holes moves after(1M memory hole, 16M hole and 4G hole), moving their memory further to the back. 1M+ substracts (1M-640K), 16M+ substracts((1M-640K)+(16M-15M)) and 4G+ substracts (1M+640K)+(16M-15M)+(4G-3G) respectively for mapping the block to the actual emulated RAM allocated array.
Last edited by superfury on 2019-4-08 @ 19:10, edited 1 time in total.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby BloodyCactus » 2019-4-08 @ 19:06

well behaved software shouldnt do it. dosbox is built to run games not to be a cycle level simulation.
--/\-[ Stu : Bloody Cactus :: http://kråketær.com :: http://mega-tokyo.com ]-/\--
User avatar
BloodyCactus
Oldbie
 
Posts: 900
Joined: 2016-2-03 @ 13:34
Location: Lexington VA

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby superfury » 2019-4-08 @ 19:17

Well, UniPCemu isn't Dosbox(it doesn't try to cut corners for performance by breaking compatibility).

Still, it's kind of strange that Dosbox takes ~0% CPU for 3MIPS at 100% speed, while UniPCemu barely keeps up(25%-30% speed of 3MIPS taking ~17% of an Intel i7-4790K@4.0GHz).

Most of the time issue seems to be said memory accesses(the hot path) themselves(reading memory and I/O devices), with memory taking ~0.7% CPU cycles to read(compared to the CPU in total for ~50%, VGA for 15% and emulator in total for ~97%), according to the Visual Studio Community 2018 profiler.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby hail-to-the-ryzen » 2019-4-09 @ 04:02

First compare against pcem instead since its design is likely more similar to unipcemu. Also, first test with a benchmark that taxes the major emulated components and then test the individual parts.
hail-to-the-ryzen
Member
 
Posts: 331
Joined: 2017-3-09 @ 01:34

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby Gene Wirchenko » 2019-4-09 @ 18:59

So the up-shot is cheating^Wclever optimisation. Well done.
Gene Wirchenko
Member
 
Posts: 148
Joined: 2005-7-14 @ 23:35
Location: Kamloops, BC, Canada

Re: How does Dosbox manage to get so much performance in it's interpreting CPU emulation?

Postby superfury » 2019-4-09 @ 20:52

So simply said, does anyone have some good optimization advice regarding x86 CPU emulation optimization? No matter what, the current bottleneck still seems to be the innermost CPU handling, in particular the applymemoryholes function combined with the invalid memory 'if'-statement at it's calling line. All the other parts don't count (if at all), according to the profiler.
superfury
l33t
 
Posts: 3228
Joined: 2014-3-08 @ 11:25
Location: Netherlands


Return to DOSBox Development

Who is online

Users browsing this forum: Targaff and 1 guest