VOGONS


First post, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Hi!

During the work on the hq2x scaler, I had to learn that dosbox renders the screen line-by-line, probably emulating the internals of the VGA chip quite closely. Unfortunately, the hq2x scaler needs more memory than the other scalers, so I am seeing lots of L1 cache misses. In the standalone implementation of hq2x which I use as base, this is not as bad, since the loop runs tighter.

So my questions:

Is there a good reason for calling the scaler line-by-line instead of on a whole buffer at a time?

Is it possible that this is changed so the scaler gets a full screen at a time?

Or else, if there are serious technical issues, could a shortcut be made for the game-typical 320x200x8, just like my code does? (in hq2x, having a separate 320x200 implementation gained 2fps due to better compiler optimization)

Reply 1 of 5, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

we used draw the screen at once

we switched to line drawing to more compatible (e.g. palet changes during drawing) and some other reasons

But harekiet can answer this better.

Water flows down the stream
How to ask questions the smart way!

Reply 2 of 5, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Ah, and another one. Is there a compelling reason for the OPL3 code to use integer math? Would FP-math (or rather, SSE) break anything? I've read a comment which talks about losing precision by intention, so I'm not sure if that would be an alternative...

(Background: the sine table lookup in chan_calc is responsible for lots of L1 cache misses, oprofile tells me. Since hq2x uses lots of mem, the opl code is affected by it. As part of my Pentium3 optimization, I'd like to try moving both pieces to SSE, hopefully eliminating a few lookup tables and thus increasing speed.)

Reply 3 of 5, by Harekiet

User metadata
Rank DOSBox Author
Rank
DOSBox Author

We used to draw the screen at one when it was still running in another thread, but made it the graphics update part of the main thread and if you don't want to delay the rest of the machine too much you split it up in lines.
And the compelling reason why opl3 uses integer math since it works and i can't really care to rewrite it 😀

Reply 4 of 5, by mirekluza

User metadata
Rank DOSBox Moderator
Rank
DOSBox Moderator

@MOE: I hate to spoil your fun, but I think that you should not specialize so much on one processor if you want sometimes in future to have your work integrated into DOSBOX (as opposed to a separate patch).
DOSBOX runs on different processors (not only Intel). Some are Intel compatible (Athlon), some not (there is e.g. version of DOSBOX for Apple etc.).

Doing specialized version for just *one* type of Intel processor is IMHO not the way forward. Personally I doubt it will help very much anyway (but I admit I may be wrong here). DOSBOX has one processor dependent part (dynamic core), but this is specifically marked and generally separated (and DOSBOX does not need it).
Making other processor type dependent parts (or even Intel dependent parts) does not help much.

Mirek

Reply 5 of 5, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Harekiet: thanks alot. I'm still not sure where to go from here, but I will do some testing to see if it could be worth the effort.

Is there anything I should watch when throwing SSE at different code? Most importantly, I'd like to use the intel compiler syntax, i.e. no assembler, but built-in functions that result in the given instruction. GCC understands those functions as well, so it would work cross-OS, only I don't know if borland/ms/... understands it. (Rationale: The compiler knows much better how to order instructions and allocate registers than me.)

mirekluza: Look at the hq2x code. It is cross-platform. And fast. And yet, dosbox isn't fast enough to run on a 733MHz box (my new testing platform) or even 333MHz (my laptop, importqant target for me), so I wanna do something about it. And face it, people are using intel a lot, so my work will definitely not be wasted. And no one is talking about making a version specific to one CPU. SSE is available on all CPU's from P3/Athlon-Tbred upwards. The P2 optimizations were made in a way so that later CPUs or even other architectures gain as well. I ain't that short-sighted.