Reply 120 of 227, by Kronuz
- Rank
- Member
Study on the contiguous/non-contiguous access dilemma for DOSBox graphics optimization.
Due the complexities of some programs that run under DOSBox and due the way DOSBox was designed, it writes to the "video memory" in a line per line basis. The DOSBox scalers are called once per each line in the frame to process the whole line and modify the memory corresponding to that single line in the screen surface (the screen surface is a "buffer" of memory that's kept so later the video card can actually write what's on it to the screen). The way the scalers worked was to process the whole frame (line by line) no matter if the line had changed since the last frame or not; and some scalers can be very, very slow (i.e. Hq2x and even AdvMame3x).
What I do to optimize DOSBox and to improve the overall speed of the scalers is that, before letting the scalers "process" the line, I analyze it so that I know what parts of it really changed since the last frame and then I tell the scalers what parts of the line need to be processed and updated. Scalers, as we know, once they've read a pixel, they need to "expand" it so that that very same pixel becomes four (2x) or nine (3x) pixles in your "scaled" screen. In that way, when the scalers write a pixel to the screen surface, they need to write the new four or nine pixels: one to one position of the screen, others to the position(s) right next to it and also the positions right bellow to them.
One problem arises when you want to write pixels to the positions "bellow": writing to the "screen surface", when the screen surface is located in video memory, is very slow when you don't write pixels in a contiguous way (that is if you don't write to the pixels directly to the left or to the right of the last written one) the old scalers solved this problem by having "write cache" lines that were used to temporary keep the pixels bellow the current line being written while the line ended (thus allowing those pixels to be written later, when their access could be made contiguous) This worked because, no matter what, the whole line was processed and at the end, when it was the turn of the write cache to be written to the screen surface in a contiguous way, it was just simply copied from the write cache to the screen surface.
After my optimizations, this was no longer possible, since the "write cache" no longer contained the whole line, but just the parts that were changed, so one could not just simply copy the whole "write cache" anymore, one needed to copy just the parts that changed (everything else in the "write cache" is garbage.) So I decided to get rid of the "write cache lines" and to directly access the screen surface (even when I was using the some times "slow" non-contiguous access.)
I have analyzed our options to further optimize the access to the surface modified by the scalers to update the video in DOSBox, and I've come to the point were we have two options, of which probably none can improve the speed:
1) put back the "write cache lines" (WCL from now on) that I removed from the scalers.
2) use a software "shadow copy" (SC from now on) of the whole screen when the screen surface is in hardware video memory.
The first option surely would improve the access speed per write when the surface is really on hardware memory (that is non-windowed mode and only in new video cards), since the access to the video memory would end up being lineal or contiguous again; but unfortunately, as I explained early, the scalers would only fill the parts of the line that changed since the last frame, so we would have to either copy from "somewhere" what was there in the same line but on the previous frame, so that the scalers "complete" the updated parts and later it's possible to copy the whole WCL to the hardware screen surface *or* copy from the WCL to the hardware screen surface just the parts that were really actually written (and changed) in it by the scalers. This would lead to one of these scenarios: doing a memory copy from "somewhere" (could be the hardware surface), working directly in a shadow copy (which is our next option), or adding further complexity and more checks in order to copy just the parts of the WCL that changed to the hardware surface. Any of these scenarios would most likely slow down the use of a "write cache" also rendering useless the optimization of "just writing what really changed" for at the very least half of time.
Our second option, using a SC, would make it always faster to access the memory in a non-contiguous way (which is basically how the scalers need to access it), since the access would be made to system memory and not video memory, on which non-contiguous access is slower; even more, letting the scalers use a SC would make it unnecessary to lock/unlock the video memory every time which on some systems could be slow and totally pointless if nothing really changed since the last frame. Unfortunately, having the SC, would end up being slower for a reason: it's impossible or very expensive (slow) to detect all the smallest regions of the screen that really changed since the last frame, so that only those parts are blitted (copied from the SC to the screen surface) and thus the computer would end up "blitting" a lot of parts that really didn't change... killing the whole purpose of having the SC.
Two factors need to be considered as well. 1) not all "screen surfaces" are really "hardware screen surfaces" (using video memory, where non-contiguous access is slow); as I said, for a start, only full-screen modes are eligible for hardware surfaces and not all video cards (though most new ones do) support them; and some output modes often do not use hardware screen surfaces (i.e. OpenGL). 2) even when direct non-contiguous access to video memory is slower, the optimized scalers only access, for the most part, very small parts of the screen surface on each frame, probably making this relatively slower access not representative.
So, after much thinking on this issue, I believe it's probably better, easier and faster for the computer to keep the overhead of non-contiguous access to the (some times) hardware screen surface, than using any of the two options I could think of for solving this specific problem.
I guess, at this moment, there's not much left to optimize in DOSBox scalers.
Please share your comments and thoughts.
Kronuz.
Kronuz
"Time is of the essence"