Graphics performance boost

Reply 120 of 227, by Kronuz

Posted on 2005-12-01, 07:17

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

Study on the contiguous/non-contiguous access dilemma for DOSBox graphics optimization.

Due the complexities of some programs that run under DOSBox and due the way DOSBox was designed, it writes to the "video memory" in a line per line basis. The DOSBox scalers are called once per each line in the frame to process the whole line and modify the memory corresponding to that single line in the screen surface (the screen surface is a "buffer" of memory that's kept so later the video card can actually write what's on it to the screen). The way the scalers worked was to process the whole frame (line by line) no matter if the line had changed since the last frame or not; and some scalers can be very, very slow (i.e. Hq2x and even AdvMame3x).

What I do to optimize DOSBox and to improve the overall speed of the scalers is that, before letting the scalers "process" the line, I analyze it so that I know what parts of it really changed since the last frame and then I tell the scalers what parts of the line need to be processed and updated. Scalers, as we know, once they've read a pixel, they need to "expand" it so that that very same pixel becomes four (2x) or nine (3x) pixles in your "scaled" screen. In that way, when the scalers write a pixel to the screen surface, they need to write the new four or nine pixels: one to one position of the screen, others to the position(s) right next to it and also the positions right bellow to them.

One problem arises when you want to write pixels to the positions "bellow": writing to the "screen surface", when the screen surface is located in video memory, is very slow when you don't write pixels in a contiguous way (that is if you don't write to the pixels directly to the left or to the right of the last written one) the old scalers solved this problem by having "write cache" lines that were used to temporary keep the pixels bellow the current line being written while the line ended (thus allowing those pixels to be written later, when their access could be made contiguous) This worked because, no matter what, the whole line was processed and at the end, when it was the turn of the write cache to be written to the screen surface in a contiguous way, it was just simply copied from the write cache to the screen surface.

After my optimizations, this was no longer possible, since the "write cache" no longer contained the whole line, but just the parts that were changed, so one could not just simply copy the whole "write cache" anymore, one needed to copy just the parts that changed (everything else in the "write cache" is garbage.) So I decided to get rid of the "write cache lines" and to directly access the screen surface (even when I was using the some times "slow" non-contiguous access.)

I have analyzed our options to further optimize the access to the surface modified by the scalers to update the video in DOSBox, and I've come to the point were we have two options, of which probably none can improve the speed:
1) put back the "write cache lines" (WCL from now on) that I removed from the scalers.
2) use a software "shadow copy" (SC from now on) of the whole screen when the screen surface is in hardware video memory.

The first option surely would improve the access speed per write when the surface is really on hardware memory (that is non-windowed mode and only in new video cards), since the access to the video memory would end up being lineal or contiguous again; but unfortunately, as I explained early, the scalers would only fill the parts of the line that changed since the last frame, so we would have to either copy from "somewhere" what was there in the same line but on the previous frame, so that the scalers "complete" the updated parts and later it's possible to copy the whole WCL to the hardware screen surface *or* copy from the WCL to the hardware screen surface just the parts that were really actually written (and changed) in it by the scalers. This would lead to one of these scenarios: doing a memory copy from "somewhere" (could be the hardware surface), working directly in a shadow copy (which is our next option), or adding further complexity and more checks in order to copy just the parts of the WCL that changed to the hardware surface. Any of these scenarios would most likely slow down the use of a "write cache" also rendering useless the optimization of "just writing what really changed" for at the very least half of time.

Our second option, using a SC, would make it always faster to access the memory in a non-contiguous way (which is basically how the scalers need to access it), since the access would be made to system memory and not video memory, on which non-contiguous access is slower; even more, letting the scalers use a SC would make it unnecessary to lock/unlock the video memory every time which on some systems could be slow and totally pointless if nothing really changed since the last frame. Unfortunately, having the SC, would end up being slower for a reason: it's impossible or very expensive (slow) to detect all the smallest regions of the screen that really changed since the last frame, so that only those parts are blitted (copied from the SC to the screen surface) and thus the computer would end up "blitting" a lot of parts that really didn't change... killing the whole purpose of having the SC.

Two factors need to be considered as well. 1) not all "screen surfaces" are really "hardware screen surfaces" (using video memory, where non-contiguous access is slow); as I said, for a start, only full-screen modes are eligible for hardware surfaces and not all video cards (though most new ones do) support them; and some output modes often do not use hardware screen surfaces (i.e. OpenGL). 2) even when direct non-contiguous access to video memory is slower, the optimized scalers only access, for the most part, very small parts of the screen surface on each frame, probably making this relatively slower access not representative.

So, after much thinking on this issue, I believe it's probably better, easier and faster for the computer to keep the overhead of non-contiguous access to the (some times) hardware screen surface, than using any of the two options I could think of for solving this specific problem.

I guess, at this moment, there's not much left to optimize in DOSBox scalers.

Please share your comments and thoughts.

Kronuz.

Last edited by Kronuz on 2005-12-01, 07:51. Edited 2 times in total.

Kronuz
"Time is of the essence"

Reply 121 of 227, by Qbix

Posted on 2005-12-01, 07:22

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11323
Joined: 2002-11-27, 14:50
Location: Fryslan

please contact harekiet.
He was inspired by your work and has started creating something similiar as you. (think some of your code is used). but he has tied it directly to the vga emulation. futher he has taken in account the proper write order (write cache).

I could explain more of it. but it would more simple if you got into touch with him. Just yell at irc a few times. 😀

Water flows down the stream
How to ask questions the smart way!

Reply 122 of 227, by DosFreak

Posted on 2005-12-01, 07:30

DosFreak Offline

Rank l33t++

Rank: l33t++
Posts: 13548
Joined: 2002-06-30, 16:35
Location: Milliways

I guess, at this moment, there's not much left to optimize in DOSBox scalers

heh. Now all we need is a guy like you who is an expert on improving dynamic core speed as much as you improved scaler speed.... 😉 (Never going to happen as dramtically tho but at least a couple % increases would be nice)

How To Ask Questions The Smart Way
Make your games work offline

Reply 123 of 227, by Kronuz

Posted on 2005-12-01, 07:32

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

Qbix wrote:
please contact harekiet.
He was inspired by your work and has started creating something similiar as you. (think some of your code is used). but he has tied it directly to the vga emulation. futher he has taken in account the proper write order (write cache).

I could explain more of it. but it would more simple if you got into touch with him. Just yell at irc a few times. 😀

I did, Qbix, that's when the whole dilemma started and we even had a little "debate" about it 😉 This is a follow-up to his comments and the reasons why I think it would be better not to use a "write cache" of any kind 😀 I posed it here to see what other people with more experience think about it.

Though I might be wrong in some of my premises, because I’m very new to the modern graphics world, I think in essence my reasoning is pretty sound. I just want to know what others think (including you, Harekiet, Moe, gulikoza, and everybody.)

😀

Kronuz
"Time is of the essence"

Reply 124 of 227, by Qbix

Posted on 2005-12-01, 07:37

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11323
Joined: 2002-11-27, 14:50
Location: Fryslan

allright 😀 well the write cache was for the agp bus I recall. But I will stay away from this discussion till I've seen more opinions about it.

Water flows down the stream
How to ask questions the smart way!

Reply 125 of 227, by Kronuz

Posted on 2005-12-01, 07:40

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

Oh, btw Qbix, If you talk to Harekiet, tell him to check my new patch, it has some improvements in the way I check the cache for changes what would make it a bit faster from that on version 7 😉

(Just in case I can't talk to him)

[Tell him to get it here at the forums, since sourceforge seems to be down and I couldn't put the patch there]

Kronuz
"Time is of the essence"

Reply 126 of 227, by eL_PuSHeR

Posted on 2005-12-01, 10:51

eL_PuSHeR Offline

Rank l33t++

Rank: l33t++
Posts: 6570
Joined: 2003-06-20, 16:39

Kronuz wrote:
eL_PuSHeR wrote:
Hey, Kronuz. Latest Latest UPack 0.38ß packs your dosbox.exe to just 470,3KB and it seems it is working well.

I couldn't find the 0.38ß version of UPack... all the official pages seem to be down, all I could find is version 0.30ß and it had the same problem as with UPX 🙁

http://dwing.wex.cn/ is working fine for me. Anyway UPACK is small, so...

PS - Kronuz, Qbix, Harekiet, Gulikoza, Moe et all - You are doing a great job. Congratulations. 😎

PS2 - For UPACK, I use the following command-line parameters:

-c6 -f255 -red -rai

Works fine for your Nov, 30th release too.

Intel i7 5960X
Gigabye GA-X99-Gaming 5
8 GB DDR4 (2100)
8 GB GeForce GTX 1070 G1 Gaming (Gigabyte)

Reply 127 of 227, by eL_PuSHeR

Posted on 2005-12-01, 15:52

eL_PuSHeR Offline

Rank l33t++

Rank: l33t++
Posts: 6570
Joined: 2003-06-20, 16:39

Kronuz. I've been testing the dosbox.mode1.exe using scaler=advmame2x and output=overlay and it is even capable of maintain correct speed for protected mode games on my Sempron 3000 (I tried Psycho Pinball from Codemasters). Nice work!.

Intel i7 5960X
Gigabye GA-X99-Gaming 5
8 GB DDR4 (2100)
8 GB GeForce GTX 1070 G1 Gaming (Gigabyte)

Reply 128 of 227, by Kronuz

Posted on 2005-12-01, 16:54

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

eL_PuSHeR wrote:
Kronuz. I've been testing the dosbox.mode1.exe using scaler=advmame2x and output=overlay and it is even capable of maintain correct speed for protected mode games on my Sempron 3000 (I tried Psycho Pinball from Codemasters). Nice work!.

That's great! 😀 indeed mode1 should be faster with some output modes, but overlay should be the same speed in both modes; mode1 though, should increase speed in surface, opengl and ddraw output modes and from all those, I would probably go for either surface or opengl.

Kronuz
"Time is of the essence"

Reply 129 of 227, by eL_PuSHeR

Posted on 2005-12-01, 19:17

eL_PuSHeR Offline

Rank l33t++

Rank: l33t++
Posts: 6570
Joined: 2003-06-20, 16:39

Have to check that. Anyway, overlay seems to be the best suited mode for my ATi Radeon 9600. 😎

Well, I got some heavy artifacts using DDRAW (the DOSBox prompt font becomes garbage) and it's slow.

Openglnb is also slow but it seems to display properly.

Surface displays properly and it seems faster than DDRAW or OpenGLnb.

Still, Overlay is the best option for my card.

Intel i7 5960X
Gigabye GA-X99-Gaming 5
8 GB DDR4 (2100)
8 GB GeForce GTX 1070 G1 Gaming (Gigabyte)

Reply 130 of 227, by HunterZ

Posted on 2005-12-01, 19:55

HunterZ Offline

Rank l33t++

Rank: l33t++
Posts: 6171
Joined: 2003-01-31, 19:04
Location: Seattle

Actually, it seems that SDL doesn't really use hardware (YUV?) overlays (at least not on ATI) but rather uses surface mode to emulate them. As a result, it's probably faster but less pretty to use surface.

Reply 131 of 227, by `Moe`

Posted on 2005-12-01, 20:21

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

Kronuz, SDL has the SDL_SWSURFACE flag to request a software surface. That already does the "right" thing, i.e. creating a SC when needed. (Partial) blitting is done via SDL_UpdateRect(), which is potentially hardware-accelerated. If tere's indeed a measurable performance bottleneck (I'd love to see numbers), then IMHO that's the way to go.

Reply 132 of 227, by Kronuz

Posted on 2005-12-01, 20:47

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

`Moe` wrote:
Kronuz, SDL has the SDL_SWSURFACE flag to request a software surface. That already does the "right" thing, i.e. creating a SC when needed. (Partial) blitting is done via SDL_UpdateRect(), which is potentially hardware-accelerated. If tere's indeed a measurable performance bottleneck (I'd love to see numbers), then IMHO that's the way to go.

that's what I meant by "not all screen surfaces are really hardware screen surfaces" 😉 some times even using SDL_HWSURFACE you really get a shadow copy... and that's one of my points on why not using a shadow copy to solve the problem either 😀

Kronuz
"Time is of the essence"

Reply 133 of 227, by Kronuz

Posted on 2005-12-02, 02:00

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

Okay this is my latest build it has fixed many issues and it's now even a little bit faster; it also is compiler super-optimized, so it should be about 10-20% faster than my last binary. The aspect correction issues are almost gone, and there are probably some performance issues to fix there in that area, but nothing big.

There's one thing though, that still needs to be fixed. As this latest patch really, really updates only the parts of the screen that changed when they actually have changed, you'll notice some output modes (i.e. surface) do not update the content of the window when you move other window over DOSBox's own window (or when you move DOSBox window out and back in to the desktop area) This is NOT a bug in the patch, it's the way it should be, however I understand that's not fine and that it's an issue that needs to be fixed. All it's needed to fix this issue is either to force a complete or a partial redraw of the sections of the window that just became visible after being covered by other window or coming from outside the desktop, but that's not as easy as it sounds if you want to keep portability, so we'll have to wait and see what can be done about this without breaking portability.

This issue, however, even has a positive side! using surface output mode, you now can move a window over DOSBox and watch efficiency in action 😉 You'll also notice that I'm eventually doing full redraws of the screen, so from time to time the whole screen is always redrawn. You'll be able to see how other output modes behave differently DDraw being the most Windows friendly (with it, you can't see any traces that the window is not being fully redrawn every time)

Please check this new patch and let me know what you think, the speed improvement's you've noticed and any problems that you come across.

Qbix: Attachment removed

Kronuz
"Time is of the essence"

Reply 134 of 227, by Kronuz

Posted on 2005-12-02, 02:36

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

I really hadn't checked, but also compare the CPU usage of DOSBox with and without my patch when you're playing a game 😉

Not that I say it, but it's AMAZING!! I now can smoothly play Warcraft 2, like in my old days. I'm running it at 20,000 cycles and with 0% CPU usage!! whereas before I couldn't even begin to play without the sound being choppy and the game being really slow... all that along the huge noise of my CPU fan at full power and the CPU usage at maximum.

Kronuz
"Time is of the essence"

Reply 135 of 227, by GreatBarrier86

Posted on 2005-12-02, 02:58

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

This isn't a criticism but are you going to keep your dosbox exe uncompressed? I'd vote to keep it like that.

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 136 of 227, by GreatBarrier86

Posted on 2005-12-02, 02:59

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

Also, i noticed one bug. I have the conf set to open dosbox to fullscreen and when it does, the location that mouse is on the screen when it is open in fullscreen is corrupted. It looks like it didn't update that spot on the screen.

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 137 of 227, by neowolf

Posted on 2005-12-02, 03:46

neowolf Offline

Rank Member

Rank: Member
Posts: 115
Joined: 2005-08-31, 06:59

I hate to bothersome but are there any plans to switch to a different name other than Rect? It seems like it'd be a good idea to go ahead and adjust it slightly before it starts to get merged down the line. It's easy to switch around while it's a patch but I'd hate to have to dig in for it. 😀

"Omne ignotum pro magnifico"

Reply 138 of 227, by DosFreak

Posted on 2005-12-02, 03:49

DosFreak Offline

Rank l33t++

Rank: l33t++
Posts: 13548
Joined: 2002-06-30, 16:35
Location: Milliways

Kronuz wrote:
I really hadn't checked, but also compare the CPU usage of DOSBox with and without my patch when you're playing a game 😉

Not that I say it, but it's AMAZING!! I now can smoothly play Warcraft 2, like in my old days. I'm running it at 20,000 cycles and with 0% CPU usage!! whereas before I couldn't even begin to play without the sound being choppy and the game being really slow... all that along the huge noise of my CPU fan at full power and the CPU usage at maximum.

Yeah, your patch really allows those later DOS games to be playable in DosBox. I'd like to try this out on older systems tho but I have none handy. Don't really feel like fiddling around with my BIOS either 'cause I'm lazy. I'd like to see the XBOX users try this out and see what their reports are. They are probably the most popular user-base using the same processor speed out there (well asumming they haven't upgraded the processor anyway). It would be interesting to see if they can play the games that were previously unplayable.

XBOX Compatibility List
http://forums.xbox-scene.com/lofiversion/inde … hp/t313261.html

Looking at their list it looks like all the DPMI games are the slow ones. Assuming that Dosbox 733mhz is enough for these games with the latest patch then they should be at least able to play these games in 320x modes. For the non protected mode games they can probably use scalers now. Weird. Imagine HQ2X scaler on POS SD TV! 😉

How To Ask Questions The Smart Way
Make your games work offline

Reply 139 of 227, by DosFreak

Posted on 2005-12-02, 05:12

DosFreak Offline

Rank l33t++

Rank: l33t++
Posts: 13548
Joined: 2002-06-30, 16:35
Location: Milliways

Just benchmarked Quake 1 using your latest build

320x200
Demo1
16.8fps

Amazing. You can actually play Quake 1 in DosBox now.

Can't really remember what res I played it on on my DX4/100 tho. I'm thinking 320x which is why I played Duke Alot because I could play it in 640x smoothly.

So
Athlon XP 2800+
Quake 1@320x200=15fps
Duke3D=@640x480=15fps

I can't remember my benchmarks scores back on my DX4 but I highly doubt that I was getting 30fps in Quake 1 and I highly doubt I was getting 30fps in Duke3D. I'm betting that the above scores are on par with DX4/100 performance.

Anyone have any DX4/100 benchmarks for Quake1/Duke3D handy? I think I might have some tucked away somewhere that I found on the net awhile back so I'll look.

EDIT

Found some benchmarks in the ol' "FastiVid" utility.

Duke Nuke'm 3D (640x480, fps) 14
Doom Benchmark (fps) 38

Documentations doesn't mention processor speed but it does mention the "Intel Aurora motherboard" (82450 chipset) which supported 150 to 200mhz.

The above benchmarks match pretty closely to the performance that I'm getting.

NOTES FPS counter in Doom 1 doesn't work right so you need to go by the FPS counter in the title bar.
Duke Nukem 3D "DNRATE" matches FPS rate in title bar.

Hopefully his benchmarks were taken running around the level which is what I'm doing. Haven't bothered to actually record/playback any demos yet.

I remember back when Quake 1 first came out and the box saying it required a "Pentium", I laughed as I always did at "System Requirements" that declared that the game required a "Pentium" when I already had a 486DX4/100 with a VLB Cirrus Logic card. I remember Quake 1 being very very system heavy....very very brown....and very very boring. So after beating it I quickly went back to replaying Duke Nukem 3D and third party maps and Lan matches and didn't think of Quake again until Quake 2......and it was still brown.....so I played Unreal alot instead. Then Quake 3 came out and it was still brown but hey there's some more color! But guess what UT was more fun for single-player use (which is what I play mostly) so I went back to UT. Then Doom 3 came out and I beat but I felt bored throughout that experience and well....UT 2004 F'in sucks for single-player so I didn't have any UT to turn to except the original UT. 😀

How To Ask Questions The Smart Way
Make your games work offline

Main menu

Topic actions

Reply 120 of 227, by Kronuz

Reply 121 of 227, by Qbix

Reply 122 of 227, by DosFreak

Reply 123 of 227, by Kronuz

Reply 124 of 227, by Qbix

Reply 125 of 227, by Kronuz

Reply 126 of 227, by eL_PuSHeR

Reply 127 of 227, by eL_PuSHeR

Reply 128 of 227, by Kronuz

Reply 129 of 227, by eL_PuSHeR

Reply 130 of 227, by HunterZ

Reply 131 of 227, by `Moe`

Reply 132 of 227, by Kronuz

Reply 133 of 227, by Kronuz

Reply 134 of 227, by Kronuz

Reply 135 of 227, by GreatBarrier86

Reply 136 of 227, by GreatBarrier86

Reply 137 of 227, by neowolf

Reply 138 of 227, by DosFreak

Reply 139 of 227, by DosFreak