VOGONS


Optimization Flags for Build DOSBox

Topic actions

First post, by priestlyboy

User metadata
Rank Oldbie
Rank
Oldbie

What optimization flags do you use to build DOSBox for Windows?
I'm curious since I don't do anything except
"./configure --enable-core-inline" which sets the CFLAGS and CXXFLAGS to -g -O2.

I wondering what the best optimization would be to support everyone?

I'm looking at the GCC website curiously reading everything, 🤣.

Also do you any of you guys use MingW? If you do are you using the 3.4.0 build from MingW? Is it any better than 3.3.3?

Thanks guys, Just trying to give the best support there is. 😀

Ieremiou
----------
Helping Debug DOSBox.

Reply 2 of 31, by priestlyboy

User metadata
Rank Oldbie
Rank
Oldbie

Oh ok sorry, um but anyways "my question." What are the best optimization Flags, options to compile DOSBox with?

Specifically what do you execute to configure/compile DOSBox.

Ieremiou
----------
Helping Debug DOSBox.

Reply 3 of 31, by Darkfalz

User metadata
Rank Member
Rank
Member

I'm not sure but I do remember the settings Harekiet gave me did make a much faster version than the default options. So they are quite important. Interestingly I found that -march=pentium4 slowed things down (in MAME they are a big help).

Reply 7 of 31, by DoomWarrior

User metadata
Rank Newbie
Rank
Newbie

thats why everybody should compile his dosbox for his processor 😉

use -mcpu=i686 or athlon 😉
it generate optimized code for i686 BUT is also runable on i386,i486 and i586 Systems !

but -mcpu=i686 vs. i386 (and also -mach) speeds up emulation ~2-5%. It's really not necessary. There are other interessting speed up options for x86...

for example -mmmx -msse and so on. But keep in mind not everybody has a SSE-System !
also -mno-push-args, -m128bit-long-double, -ffast-math , -fomit-frame-pointer and -funroll-loop

if you try dosbox on non-x86 -mcpu is much mor important (for example on a Sparc the diffrents between V7 and V8 is about 20%)

Reply 8 of 31, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

If we are talking about gcc, it already does a fairly good job at enabling best optimizations. One very important optimization is: use the latest gcc you can lay your hands on. Current version is 3.4.1, which is a big improvement over 3.3, which is a big improvement over ...

Flags still needed/useful:

-O3 (max optimizations)
-funroll-all-loops (unroll loops with static and variable number)
--param max-unrolled-insns=60 (keep unrolling cache-friendly)
-march=<your cpu> (see above: verify it really helps)
-fomit-frame-pointer (frees up one register for optimizations)
-ffast-math (probably doesn't make a difference, there's not much fp-math anyways)

You specifically don't need -mmx/-sse, as -march already enables these as appropriate.

Unfortunately, the optimizer in gcc 3.4 has changed, so loop unrolling may have better settings now. These flags were verified on gcc 3.3 to be the best I could get.

You may also try profile guided optimization. It basically works like this: add -fprofile-arcs to the cflags, compile, keep the source, run some stuff. Recompile with -fbranch-probabilities instead of -fprofile-arcs. Enjoy another 3-7% (not tested yet).

Oh, and stop these jokes about people using P2 or lower to use dosbox. I do. Happily. Playable. With Hq2x. 😎

Reply 9 of 31, by priestlyboy

User metadata
Rank Oldbie
Rank
Oldbie

Note if you compile with gcc 3.4.1, it no longers recognizes the PACK call in MingW. I've tested it.
For me at least I'm sticking with gcc 3.3.3 for another reason in that the size of the executable keeps enlarging dramaticly each change in GCC.

But yes, those optimizations I've known of. Haven't tested any of them because I had no idea what they were for.

Yeah -03 may be a max compiling but it also may be unstabilizing and it takes much longer to compile with.

Ieremiou
----------
Helping Debug DOSBox.

Reply 10 of 31, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

With 3.3, O3 is no longer unstabilizing. I've been compiling my whole system with O3 without any problems. I've been doing that since gcc-2.8, so I know what you mean. These days it works, including loop unrolling, which has also been a problematic setting in older gccs.

Compile time, well... you don't really mind a minute longer at compile when it means a couple of cycles more when running, don't you? 😀

I hope your first impression of gcc-3.4 is tunable. The optimizer enhancements sound promising, but my P2 suffers from big code size due to small caches.

Reply 11 of 31, by priestlyboy

User metadata
Rank Oldbie
Rank
Oldbie

I agree. What would you said would be the best optimizations that wouldn't break stability for compiling DOSBox and that would still be usuable for comps across the board?

Just trying to read ALL the freaking options on this page Options That Control Optimization (for GCC 3.4.1) makes my brain sweat.
For all you Intel and AMD builders Intel 386 and AMD x86-64 Options

Ieremiou
----------
Helping Debug DOSBox.

Reply 12 of 31, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Well, I have checked out gcc-3.4.1 now, and it seems the profile feedback optimizations are really important. Some switches have been renamed, and the docs say -funroll-loops is enabled when using profile data. gcc avoided any unrolling optimizations before, you had to enable them separately. I interpret this as: Loop unrolling is useful when using profile feedback, but not so much otherwise.

I have not benched anything yet, but my subjective impression is that the following CXXFLAGS improve things, but barely noticeably:

-O3 -fmerge-all-constants -funroll-loops -ffast-math -funswitch-loops

-march=... is important, but difficult to use for compilations across the board. You should definitely use -mtune=<most-common-cpu>, however.

If you have the patience, add -fprofile-generate to the switches, compile, run dosbox with a few different games, using different video modes, cpu instructions, scalers and sound cards. (I'd love to hear which ones you take. I tried Settlers for protected mode+svga+gus and MOM for hq2x+sblaster. Didn't have other ideas yet, no useful real mode program yet). After running the sample programs, reconfigure, replace -fprofile-generate with -fprofile-use, make clean, make.

Unfortunately, a real dosbox benchmark is still missing. I can bench gcc using a synthetic benchmark, but I'd love to see something like "unlimited cpu cycles" with a display of how many cycles are actually executed, same for video frames drawn. CPU usage stats in top or task manager are too unreliable to detect improvements of a few percent. Yet, a few percent here, a few percent there, it all sums up.

Reply 14 of 31, by gulikoza

User metadata
Rank Oldbie
Rank
Oldbie

coreswitch patch on my page will show number of emulated cycles in the titlebar. It also adds a timesync mode where dosbox will only emulate the number of cycles it can thus making benchmarks produce valid results (and allowing to set cycles unreasonably high, without overloading the host machine).

Qbix: What would be the procedure to add this patch to cvs? submit it to sourceforge? I have got quite accustomed to on-the-fly coreswitching, timesync is useful as well (for example the menus in Tie fighter will not work with the cycles as high as during the flight scenes...one keypress and no more stuttering), and since Ieremiou is building his version with it as well I'd say it's pretty stable since there were no complaints 😁

Reply 16 of 31, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Well, I've tried the coreswitch patch against a quickly-hacked own benchmark indicator, and it seems the coreswitch patch costs quite a bit of performance, even when turned off.
What my patch does is to compare PIC_Ticks to GetTicks(), thus it is called less often, yet seems to be a good inidcator of overcycling. Correct me if I'm wrong, however. I've removed the coreswitch patch for now.

Reply 17 of 31, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Just a quick update: I have used my own benchmark indicator to check CXXFLAGS. I've basically turned up cycles until the dosbox-internal timer ran at just above 99% of real time, which probably means the limit for real-time execution, with some slack. I've used Frontier: First Encounters as a test, as it has a nice automatic (and CPU-intensive) intro animation with sound, I've been using SB and GUS alternatingly, Frameskip 1, no scaler.

On my P2-333, I can get 3500 (probably a bit less) cycles when using gcc-3.3.3 with the above mentioned flags. I get 4000 cycles with 3.4.1 and the wild guess posted above. But the shocking thing is: After running the intro with -fprofile-generate, then recompiling with -fprofile-use, I get 5000 cycles, that's a 25% increase!

Granted, my measurement code is far from exact, but even if profile optimization was only 10%, it would still be a huge gain. So use it!

Reply 19 of 31, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

in the timesync patch code like this is used

Bits lastticks = getticks

if (timesync && lastticks !=blah)
cycles=0;

faster+better is
if (timesync && lgetticks() !=blah)
cycles =0

as c++ only executes the getticks if tymesync is true and not always as in the first case. (get ticks is relatively heavy call)

Water flows down the stream
How to ask questions the smart way!