VOGONS

Common searches


Reply 20 of 31, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

Very interesting jmarsh,

This helps me understand it better: pixel-perfect imparts some amount of extra processing, however at a level the Pi3 can handle without palette changes.

Adding palette changes to the mix combined with pixel-perfect, now requires significantly more CPU cycles-per-frame than the Pi3 can handle, and thus causes a severe slow down. Removing pixel-perfect, and the Pi3 has just enough headroom to absorb the palette change cost during the Wolfenstein 3D intro (when running DOSBox at cycles=10000).

This also explains why I see a severe bog-down in Abuse during the brief hallway lighting-change events, even without pixel-perfect: because Abuse requires cycles=30000, which is at the edge of what the Pi3 can give. Any extra demand and things slow down.

Last edited by krcroft on 2020-01-09, 02:24. Edited 1 time in total.

Reply 21 of 31, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie

Pixel perfect shouldn't be adding much; all it does is duplicate pixels. It's just byte-copying without any processing, unlike other scalers. But you can get the GPU to do it for you by using aspect=true, output=openglnb and an output resolution that is a multiple of both 4:3 and the original resolution.

Reply 22 of 31, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

On the Pi3, pixel-perfect adds enough to crush the frame-rate during the Wolf3D palette-fade during the intro, even with cycles=10000. Patch-off, and the palette-cycles smooth. Patch-on, and it crushes. openglnb isn't available on the Pi3 because it requires an OpenGL rendering backend, where as the Pi's hardware supports OpenGLES.

Reply 24 of 31, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

Yeah; the problem is on the software / driver chain side - there's no set of headers and libraries that will let DOSBox SVN configure and build with --enable-opengl (atleast on <= Pi3+; I think Pi4 offers some kind of OpenGL-compatibility mode but I don't have one). Therefore, OpenGLES patches like this exist https://www.raspberrypi.org/forums/viewtopic.php?t=110957, as do SDL2-versions like the DOSBox fork in RetroPie, https://github.com/RetroPie/RetroPie-Setup/bl … /dosbox-sdl2.sh.

Reply 25 of 31, by cyclone3d

User metadata
Rank l33t++
Rank
l33t++

The Fade-in/outs may not have the problem... but DOSBox in general does.

I messed with it years ago and was able to reduce the CPU cycles needed by DOSBox by 20% with only optimizing a couple of the files that had case statements in them.

Nobody seemed to care and all I got was complaints on my Sourceforge project because I didn't supply a compiled version of DOSBox. Because of that I deleted the project on Sourceforge and haven't bothered with it since for the most part.

I did start looking at it a few months ago and found that the case statements were just as bad as they were years ago. I even started looking at all the files and making a list of what source files to work on.

So... as far as optimization goes at the compiler level, if you have a case statement with values that are all in sequential order... such as:
1
2
3
4
and so on and so forth, the compiler can compile into a jump table.

However, this is not the case with most case statements in DOSBox.

Most of them are either out of order or have large gaps in values.. or both.

Case statements like that do not get optimized by compilers.

It requires a lot of reworking the code, adding some more variables, etc to manually make the jump tables but it really is worth it IMO.

The program will end up using a bit more RAM due to having a bunch of manually defined arrays and more variables but it really is no big deal.

Yamaha modified setupds and drivers
Yamaha XG repository
YMF7x4 Guide
Aopen AW744L II SB-LINK

Reply 26 of 31, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

Nice, cyclone3d! I remember other posts of yours mentioning your previous optimization work in DOSBox, with significant results.

I'll be appreciative of any speed-ups: my Pi3 does around 30k cycles and my 400 MHz PowerPC-7400 is now up to 4.6k cycles thanks to jmarsh's recent dynamic recompiler for PPC.

Reply 27 of 31, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie

The only way that could possibly be true is if you were talking about the huge switch/case tables for the opcode decoding in the normal/full cores (nothing else comes close to occupying 20% of execution time)... which is why the dynamic cores exist.

Reply 28 of 31, by cyclone3d

User metadata
Rank l33t++
Rank
l33t++

The cores are the main thing I worked on before.

However, when you have a bunch of other stuff that also uses case statements that can't be optimized by the compiler, the overhead for those case statements that are used a lot can still add up.

I've also noticed a few obvious bugs / syntax mistakes just by looking through the code.

Maybe I'll get a chance to actually work on it again sometime.

Are the dynamic cores as stable as the normal/full cores? The Wiki says that that setting may cause instability or not work at all with some games.

Yamaha modified setupds and drivers
Yamaha XG repository
YMF7x4 Guide
Aopen AW744L II SB-LINK

Reply 29 of 31, by dreamer_

User metadata
Rank Member
Rank
Member

Yeah, big switch-case blocks look weird at first sight, but in practice, for modern compilers, they are not a problem - just remember to compile with -O3 for release builds and it's irrelevant.

A few weeks ago I performed some rudimentary benchmarking using Linux perf command to learn where hot spots might be located (testing with Quake timedemos and various DOSBox configurations) and the only area that stood out was emulated memory access (~10% of execution time, while other "hot" areas tend to uniformly occupy ~2-3%). There's a slight chance it was caused by unaligned memory access on x86_64, but more likely that code simply tends to trigger cache invalidation.

As for slowdowns during fade-ins/fade-outs - for me, it was quite noticeable in 0.74-3 e.g. in Mortal Kombat 2 and 3 (fade-out between rounds), but newer release builds with -O3 and with SDL2 work perfectly for me.

| ← Ceci n'est pas une pipe
dosbox-staging

Reply 30 of 31, by cyclone3d

User metadata
Rank l33t++
Rank
l33t++

-O3 still doesn't turn bad case statements into jump tables unless something has drastically changed in the last year.

You end up with giant compiled IF-Else statements which is really bad for unneeded overhead.

Yamaha modified setupds and drivers
Yamaha XG repository
YMF7x4 Guide
Aopen AW744L II SB-LINK

Reply 31 of 31, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie

If anything the dynamic cores are more stable since they handle page faults and other exceptions more gracefully.

I just checked with a 12 year old compiler (VS2008), nearly all switch/case statements use a direct table (one lookup) or indirect table (two lookups). If you can give an example of one you think can be optimised I'd be happy to check.