VOGONS

Common searches


First post, by frobme

User metadata
Rank Member
Rank
Member

EDIT: Some sort of error on my part appears to have generated unusually bad GCC run time results. Since I use GCC myself all the time, I'm at a complete loss as to why, but I'm editing this post to reflect the second run results which are competitive with the other compilers.

Hola. Recently for my own education and to test some optimizations I was working on I set out to profile the performance of various posted builds and different compilers/environments I could verify myself.

This is by no means a comprehensive set of benchmarks. The performance characteristics of DOS games vary widely and many will exercise functionality not covered by my test, which could result in edge cases that behave contrary to these results. You have been warned, mileage my vary, blah blah.

I chose to use Doom and the famous -timedemo as my reference test. As I find more highly repeatable benchmark situations that run in DosBox, I will try to add them here (most of the "system benchmark" type of apps dont get along with Dosbox, due to esoteric hardware calls). So for all of these tests, I ran "doom.exe -timedemo demo3".

My system is a Core 2 Extreme X6800 2.93Ghz, 2GB with GeForce 8800 GTX 512MB. It is not overclocked and for the purposes of the tests, I ran WinXP 32bit.

I compared the following CVS versions:

The AEP Emulation daily CVS build from 01/07/2007, referred to here as "AEP"
Gulikoza's CVS build from 11/17/2006, referred to here as "Gulikoza"
YKHwong's CVS build from 01/05/2007, referred to here as "YKHwong"

My own from scratch GCC-3.4.5 build, built with "-march=pentium4 -O3 -fomit-frame-pointer" and stripped, referred to here as "gcc-3"

My own from scratch GCC-4.1.1 build, built with "-march=pentium4 -O3 -fomit-frame-pointer" and stripped, referred to here as "gcc-4"

Both GCC builds were built under MSYS/MinGW. If you're wondering, yes that means I built GCC-4.1.1 myself, and yes most of the posted instructions on how to do that are wrong =). I also built 4.3, but it was unstable and I haven't tested it yet due to that.

An Intel ICC Compiler build, version 9.1.33, built with VisStudio as an IDE but ICC as all the compile/linking. It was optimized specifically for Core Duo. Referred to as "ICC". I did not do a program guided opt for this compiler.

My own Visual Studio 2005 built, in three flavors:
- Built with the free edition of Visual Studio that anybody can download (referred to as "VisStudio-free")
- Built with commercial VS2005 SP1 using standard release optimizations, including whole program (referred to as "VisStudio-regular")
- Built with commercial VS2005 SP1 using release optimizations as before, but with one pass through of program guided optimization (a two pass optimization system which requires you to run the program first), referred to as "VisStudio-PGO")

All of the tests were done with the same version of SDL.dll, a 1.2.11 version that was built from scratch using VisStudio 2005.

As stated, I used original DOS Doom (1.9s) and -timedemo demo3. This is self running and exits with a "realtics" value, showing how many "real time" ticks occurred for the fixed number of "game ticks" that occur in the demo. Of course within DosBox they aren't real real time ticks, but it still works as a reference. If you'd like to see a description of Doom as a benchmark and see a huge chart of real world machine results, go here. You can run the demo yourself and see where your DosBox virtual machine compares to other real hardware.

For each build of DosBox, I would run the timedemo 3 times, and average the results. DosBox settings were always the same, windowed mode, "surface" display, 0 frameskip, normal 2x scaler, dynamic core, max cycles (so it runs as fast as it can), 16MB virtual machine, SB16 audio. The only modification to the Doom setup was to turn off audio - not because it was a problem, but running 30 something tests with it on was going to drive me crazy, sorry.

I used the incredibly sweet DosBox Game Launcher from Ronald Blankendaal to make all the launching less burdensome. Thanks for an awesome front end Ron.

Results

AEP:               1104,1060,1048  avg 1070 fps 69.8
ICC: 1038,1062,1019 avg 1040 fps 71.8
VisStudio-free: 1003,996,961 avg 987 fps 75.7
VisStudio-regular: 973,976,989 avg 979 fps 76.3
Ykhwong: 960,949,1015 avg 975 fps 76.6
gcc-3: 916,1012,1013 avg 980 fps 76.2
gcc-4: 925,928,922 avg 925 fps 80.8
Gulikoza: 889,886,936 avg 904 fps 82.6
VisStudio-PGO: 850,890,900 avg 880 fps 84.9

As a comparison case, I ran the same test against my retro machine. This is a P3-1Ghz, 768 RAM, GEForce 4 TI4600 running DOS 6.2 with nothing resident except QEMM for memory management:

P3/1Ghz GeForce Ti4600			627,626,626		avg 626		fps 119.3

Conclusions

Due to some error on my part, the initial gcc builds looked poor, but this has since been solved (still dont know what was happening, I used the same environment and settings for the second builds). Because of this I'll further investigate a gcc-4.3 build and a PGO guided optimization and see what those results are.

The results (now) are all closely packed together. Obviously the PGO VS2005 build was the highest FPS, but keep in mind that PGO run was against Doom specifically, so it looked for all the functions that got maximized in Doom and did what it could with them (PGO reported 50 functions in DosBox as targets, less than 1.5% of the total functions).

I didn't list results here, but I did several one-off tests of various optimizations that aren't listed since they were not interesting. Mostly DosBox is resistant to large performance variance due to optimization features, beyond the obvious ones the compiler makes easily. So if you are spending a lot of time playing with esoteric flags on GCC like -funroll-all-loops and such, you're most likely wasting your time.

And there you have it. Hope people get something out of it, I spent way too much time on it already =). Suggestions etc are welcome. Yes, I know there need to be more test cases to validate results. Feed me repeatable benchmark conditions that are game-like and run in DosBox, and I'll put them through paces.

Couple of side notes: the fastest display surface for me on my machine is "opengl" (windowed), which resulted in 119.3 FPS in this test. It's too much work to test an X64 build with Visual Studio, because Microsoft literally removed the _asm keyword from the compiler for X64, which would require considerable restructure of the DosBox code I'm not up for right now (you can still compile assembly, it just has to be in a module by itself). Besides, all that code should be re-written for X64 pretty much, if you're going through the trouble of a native build. Something for later, and it wouldn't have a major performance impact anyway.

Enjoy.

Last edited by frobme on 2007-01-09, 19:43. Edited 3 times in total.

Reply 1 of 23, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

hmm a bit weird.
aep and ykhwong (and I think guilikoza as well)
all use mingw/GCC as well as their compiler.

Water flows down the stream
How to ask questions the smart way!

Reply 2 of 23, by gulikoza

User metadata
Rank Oldbie
Rank
Oldbie

My releases are gcc compiled, so don't throw gcc away so easily 😉. I'm also not doing any PGO, so I guess it beats VC too 🤣

There has also been some work done on the x86_64 core, when the platform is more widespread, dosbox should be ready...

http://www.si-gamer.net/gulikoza

Reply 4 of 23, by frobme

User metadata
Rank Member
Rank
Member
gulikoza wrote:

My releases are gcc compiled, so don't throw gcc away so easily 😉. I'm also not doing any PGO, so I guess it beats VC too 🤣

There has also been some work done on the x86_64 core, when the platform is more widespread, dosbox should be ready...

I'd be interested in your compile set up then. I didn't do anything unusual; the gcc-3 builds are using the stock as-shipped MinGW version of gcc, on a stock tool chain. What are you CFLAGS for an average compile if you dont mind me asking? I tried several variations, changing -march and flipping between -O2 and -O3, etc (occasionally GCC will actually produce faster code with -O2).

I could run it through an exhaustive regression with acovea, but it's super rare for acovea to yield anything above a few percentage results. I'm not sure what could be accounting for the discrepancy. The EXEs are all stripped and are roughly in the size range of the VisStudio outputs.

Also, I did do runs with other SDL.DLLs, so that wouldn't seem to account for it. My configure is basically --enable-core-inline, nothing else seemed necessary.

-frob

Reply 5 of 23, by gulikoza

User metadata
Rank Oldbie
Rank
Oldbie

Nothing fancy, MinGW latest gcc candidate (3.4.5), CFLAGS is a collection I gathered on the forum, nothing special I think:

-mtune=i686 -march=i586 -O2 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -funroll-loops --param max-unrolled-insns=60

Indeed O2 seems slightly faster 😀
I also have some graphic optimizations, my build will skip redrawing the screen if nothing has changed, but ykhwong has that too I think. You can also try to get my source tree, see if that is the difference compared to plain cvs.

http://www.si-gamer.net/gulikoza

Reply 6 of 23, by frobme

User metadata
Rank Member
Rank
Member

[quote="gulikoza"]Nothing fancy, MinGW latest gcc candidate (3.4.5), CFLAGS is a collection I gathered on the forum, nothing special I think:

-mtune=i686 -march=i586 -O2 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -funroll-loops --param max-unrolled-insns=60

I'll try these flags, but that isn't my problem; I just rebuilt the entire set with gcc-3, gcc-4 again and I'm getting results more in line with yours. There appears to have been some fundamental issue with my initial pass on GCC; I'm not sure what, as I'm quite sure I used appropriate CFLAGS/CXXFLAGS, and the executables are the right size. In any case I'll edit the post in a moment, must be my error somewhere.

-Frob

Reply 7 of 23, by frobme

User metadata
Rank Member
Rank
Member

Building with these CFLAGS/CXXFLAGS:

-mtune=i686 -march=i586 -O2 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -funroll-loops --param max-unrolled-insns=60

and using GCC-4.1.1 (what I had up in the environment when I went to test =) ) resulted in these numbers:

gcc-4,gulikoza:    957,960,1019    avg 979  fps 76.6

If you have code that's not checked into the current CVS respository limiting screen draws, that could easily account for the discrepancy.

Reply 8 of 23, by XTale

User metadata
Rank Newbie
Rank
Newbie

yeah - got the slowest one 😀

well, i'm not doing any optimizations at all 😉
just a
./configure host=i586-mingw32 && make

But my builds aren't for real daily use... more for daily testing (that's why they are updated daily too)

http://www.aep-emu.de - AEP Emulation Page
http://cvscompile.aep-emu.de - DosBox CVS builds

Reply 9 of 23, by ykhwong

User metadata
Rank Oldbie
Rank
Oldbie

The following flag is what I use.

CXXFLAGS="-s -O2 -pipe -fprofile-use -fomit-frame-pointer -mtune=i686 -march=i586" ./configure --enable-core-inline

I used to have gcc 4.1.x for mingw. It was slower than the gcc 3.4.5 for me.

Reply 10 of 23, by frobme

User metadata
Rank Member
Rank
Member
ykhwong wrote:
The following flag is what I use. […]
Show full quote

The following flag is what I use.

CXXFLAGS="-s -O2 -pipe -fprofile-use -fomit-frame-pointer -mtune=i686 -march=i586" ./configure --enable-core-inline

I used to have gcc 4.1.x for mingw. It was slower than the gcc 3.4.5 for me.

I'll check out a reference build with your flags as well, it will probably just turn in similar numbers to the downloaded version though.

It's quite possible gcc 4.1.X is slower than the 3-series, as they went through some really large changes in code base between those compilers. I'm interested in getting 4.3.X working it only to experiment, since it has a nocona (Core Duo) code emitter, just to see if it makes any difference. More often than not building to a specific proc has little effect though; so much so that Microsoft just threw the option out and always produces blended code now.

I was a little surprised at ICC's relatively middle showing here as it's historically been an excellent optimizer. Testing across three compilers like this with very similar results suggets that there isn't much in code generation that is going to make DosBox faster; rather if it gets further optimizations they will come from architectural changes or simple improvements in the approach of the code itself, not how it's compiled.

-Frob

Reply 11 of 23, by leileilol

User metadata
Rank l33t++
Rank
l33t++

Try benchmarking with Quake by starting it up with default everything (320x200, default viewsize with both status bars visible), in the start.bsp start position, with a timerefresh.

I usually test all of my computers this way over the years (P2 233 got 56fps, k6 500 gets 73fps, p100 gets 23fps, etc)

apsosig.png
long live PCem

Reply 12 of 23, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Note that -O2 is not terribly useful with PGO. If you use -fprofile-use (and I assume that you generated a profile with -fprofile-generate before), the really useful PGO-enabled optimizations are enabled by -O3, not -O2.

It's rather easy to explain why gcc-4 may seem slower than gcc-3, and O3 slower than O2 on some CPUs: Cache size. gcc-4's optimizer is way better, really. Unfortunately, many optimizations increase code size, and this can be a significant amount. So for normal builds, the gain of much better CPU optimization can easily be killed by more cache trashing, especially on low-cache CPUs like the Sempron, Celeron and some Intel Cores. The same reasoning applies to O2 vs. O3. gcc doesn't take cache size into account whle optimizing (although there are some options to influence code size).

The solution to this conflict is in fact PGO: with PGO, gcc is able to layout code so that often-used code parts are near each other, leading to less cache trashing (recent binutils are neede for that as well).

And another explanation, regarding the "bad performance with first build" mystery: When no profile information is present, gcc generates a random profile. This can in fact lead to identical builds that are not identical at all. The reason for this is empirical: The gcc folks simply tried it and found it generates noticeably better code most of the time.

Reply 13 of 23, by frobme

User metadata
Rank Member
Rank
Member
`Moe` wrote:

The solution to this conflict is in fact PGO: with PGO, gcc is able to layout code so that often-used code parts are near each other, leading to less cache trashing (recent binutils are neede for that as well).

And another explanation, regarding the "bad performance with first build" mystery: When no profile information is present, gcc generates a random profile. This can in fact lead to identical builds that are not identical at all. The reason for this is empirical: The gcc folks simply tried it and found it generates noticeably better code most of the time.

Sensible but my tested GCC builds to date haven't been PGO'd =). Actually I went to run those through last night and ended up with a problem; even using the candidate 3.4.5 compiler, I can build the instrumentation pass fine, but the subsequent rebuild using the profile crashes in GCC. I didn't have a chance to test 4.1.1 yet. It's a shame the MinGW chain is so temperemental; I should have just immediately set up a cross compile environ on my Linux machine, but I get perverse about doing things the hard way sometimes...

-Frob

Reply 14 of 23, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Yeah, that's what I mean: If you are doing a non-PGO build, it actually behaves a bit like a PGO build with random profile data. The reasoning is something like this: In some situation some decision must be taken, branches for example cannot be left out, so gcc has to guess which branch is more likely. Traditionally, branches a laid out just like the programmer did them. By using random statistical data and following data flow from that, it is actually possible to make better predictions in many cases. But since data is random, it is unlikely but perfectly possible that some builds behave noticeably worse.

By the way, gcc-4 is able to do full-program optimization, it would be an interesting experiment to check if that makes a significant difference. That's some work, however, since for that, gcc must compile the whole of dosbox in one go.

BTW, from what I have read, PGO on the 3.x series wasn't nearly as effective or stable as on the 4.x series.

Reply 15 of 23, by frobme

User metadata
Rank Member
Rank
Member
`Moe` wrote:

Yeah, that's what I mean: If you are doing a non-PGO build, it actually behaves a bit like a PGO build with random profile data. The reasoning is something like this: In some situation some decision must be taken, branches for example cannot be left out, so gcc has to guess which branch is more likely. Traditionally, branches a laid out just like the programmer did them. By using random statistical data and following data flow from that, it is actually possible to make better predictions in many cases. But since data is random, it is unlikely but perfectly possible that some builds behave noticeably worse.

By the way, gcc-4 is able to do full-program optimization, it would be an interesting experiment to check if that makes a significant difference. That's some work, however, since for that, gcc must compile the whole of dosbox in one go.

BTW, from what I have read, PGO on the 3.x series wasn't nearly as effective or stable as on the 4.x series.

Ah, gotcha. I guess enough people don't follow the "stay on the path" recommendations for branch prediction logic flow for it to be useful to always guess one direction or another.

Sadly PGO doesn't seem to work for GCC-4.1.1 either; I can build and instrument fine, run the instrumented build, it creates a .gcda file etc. But when I rebuild the next pass with "-fprofile-use" I get an internal compiler error on both 3.4.5 and 4.1.1 (in different spots).

The whole program thing looks interesting, but it's very unfriendly to current makefile style. I suppose I could kludge some kind of huge cpp file together =)

-Frob

Reply 16 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Traditionally, branches a laid out just like the programmer did them.

Actually this fact is used in several places to force faster code. ih8regs did
some testing on larger scope rearranging, which gained some speed.
The compiler works against this by using randomization in the prediction.
Of course pgo will still get it correct then, but it is not an option as games
vary largely in what code is actually used and which predictions are optimal.

Reply 17 of 23, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

wd, right, but I am with the gcc team here: code layout should follow logic and intentions of the programmer, not behaviour of one target machine (who knows, maybe in 5 years, branches are faster exactly opposite from today's branches). That's why we have the GCC_UNLIKELY macro (which I feel is misnamed, because I bet MSVC has something similar which might be used). That way, program logic and optimization are well visible and separated from each other.

On the other hand, and this is a very important point IMHO, a software that is released roughly once per year does not have any excuse not being profile optimized, making all these points moot. My personal builds always use profiling, I set up a simple shell script that runs dosbox on 5 quite different games with quite different hardware usage patterns, all that takes 1.5 times longer than a simple build and works perfectly well. 10 minutes longer for 10% more performance? Yes please.

frobme, I have no idea why it crashes for you. You are using mingw, right? Maybe that's something they didn't support correctly right now. With gcc-4, you should be able to use profile data from a linux compile on windows. gcc-3 is very picky about profile data, so if at all, gcc-4 can do that.

Reply 18 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

It's not really possible to find a low number of games which, when pgo'd,
would not have a lot other games behave quite bad. There are simply too
many factors, like graphics modes (if you have one game that requires
the custom memory access and which outweights all other memory access,
you'll get quite bad performance in most other games).
If really going that way you'd end up with several builds which are optimized
for different processor classes (and maybe even game classes). Think the
regular releases are ok speed-wise, and if you really need more speed you
can choose special architecture flags and pgo for one game. But this is
open for discussion, maybe you really have found a good set of games
that results in pgo builds which are faster for most games 😀

Reply 19 of 23, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Well, I think even using just a text-mode program for PGO is better than none at all, since there's a lot in DOSBox that can be optimized and which is not graphics mode dependent. It should be easy to pick 5 "typical" programs for their age, programs that do not pull off weird tricks. Then you'd end up with a build for the average program which behaves as well for strange programs as the current builds, but much better for the bulk. Using a weird program doesn't mean anything else will be slow either, because much of the optimization will still be common for any program.