Graphics performance boost

Reply 20 of 227, by GreatBarrier86

Posted on 2005-11-25, 04:24

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

I wonder if per chance you could fix some problems in ST JR and St 25th. Whenever a scaler other than normal2x is on, the graphics become exceptionally cartoony. I'm not sure if anyone aside from me has had this problem but maybe someone knows what might cause a problem like this and how to fix it

Also, how does this differ from the smartupdate patch from ykhwong? didn't his only update the parts of the screen that changed?

Reply 21 of 227, by neowolf

Posted on 2005-11-25, 05:25

neowolf Offline

Rank Member

Rank: Member
Posts: 115
Joined: 2005-08-31, 06:59

This seems to be vastly superior to the smart update patch. The SUP as I shall call it is a great idea, but it's change detection isn't very good. On most of my games it has problems. This patch however now has no problems at all for me and quite a noticeable boost!

I'm not sure if this is what you're talking about or not but the scalers themselves tend to blur and define images, resulting in a somewhat cartoony look at times that's just wrong on some games. This is just a side effect of what they do though. I believe the advmame scalers are implementations of Scale2X right?

Reply 22 of 227, by Kronuz

Posted on 2005-11-25, 06:34

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

GreatBarrier86 wrote:
Also, how does this differ from the smartupdate patch from ykhwong? didn't his only update the parts of the screen that changed?

I checked the smartupdate patch, the problem with it is that it should only work with some games (most likely others won't even show anything on the screen at all, if I read the source code correctly) Other than that, the smart update checks if changes were ever made to ANY part of the video memory and, if changes occurred, it just let the scalers do their work, otherwise scalers aren't called at all. The problem I see with this approach is that most (if not all) games DO update something most of the time, so the patch is probably even making some games (that managed to work) a bit slower. It looks the patch was designed for text only applications and very old games; however, it seems it could be modified and used to improve speed a bit more for those kind of applications playing along with my own patch.

neowolf wrote:
This seems to be vastly superior to the smart update patch. The SUP as I shall call it is a great idea, but it's change detection isn't very good. On most of my games it has problems. This patch however now has no problems at all for me and quite a noticeable boost!

I'm not sure if this is what you're talking about or not but the scalers themselves tend to blur and define images, resulting in a somewhat cartoony look at times that's just wrong on some games. This is just a side effect of what they do though. I believe the advmame scalers are implementations of Scale2X right?

I have to agree with neowolf in both of his observations, and for the part on the scalers, yes, AdvMameNx scalers in DOSBox are exactly simply Scale2x and Scale2x indeed. And as for the output looking cartoony, well, that's the whole point of those scalers, however I agree the look is not always optimal for all games (specially the ones with digitalized stuff)... But hey, isn't Judgment Rites supposed to look cartoonish?

Kronuz
"Time is of the essence"

Reply 23 of 227, by GreatBarrier86

Posted on 2005-11-25, 06:44

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

JR is supposed to look cartoonish but this is strange. the cartoonish look is weird. I tried taking a screenshot of it but when i open up the png, it looks normal.

I wish they would remake those 2 games for XP. That would rock!

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 24 of 227, by DosFreak

Posted on 2005-11-25, 06:58

DosFreak Offline

Rank l33t++

Rank: l33t++
Posts: 13536
Joined: 2002-06-30, 16:35
Location: Milliways

Try ALT+PRTscreen instead of using DosBox Screen capture utility.

How To Ask Questions The Smart Way
Make your games work offline

Reply 25 of 227, by Kronuz

Posted on 2005-11-25, 07:00

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

put it in a window and use "print scrn" button in the keyboard, then paste that in paint and post it somewhere... that's all I can think of 😜

Kronuz
"Time is of the essence"

Reply 26 of 227, by GreatBarrier86

Posted on 2005-11-25, 07:15

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

these are examples of each. can you tell the difference?

It looks to me like advmame3x antialiases the picture. Neo, could that be explained by the blurring you mentioned?

Reply 27 of 227, by neowolf

Posted on 2005-11-25, 07:28

neowolf Offline

Rank Member

Rank: Member
Posts: 115
Joined: 2005-08-31, 06:59

Yep. That's exactly what the filter is supposed to be doing. Basically it's doing guesswork on how the image should look scaled up, that's what all scaler filters do. Scale2x tends to get a softer image. You might wanna try HQ2X. It's a lot more CPU intensive (but with this patch not even nearly as much!) but it tends to stay a lot sharper than Scale2x or (my typical favorite in other emulators) 2xSaI. Check out the links for examples of both in action. You can see that Scale2x is doing just as you're concerned about. If you don't like it give HQ2X a try or stick to normal.

http://scale2x.sourceforge.net/
http://www.hiend3d.com/hq2x.html

Reply 28 of 227, by GreatBarrier86

Posted on 2005-11-25, 07:36

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

wait a sec. i just thought of something. I think i figured out why it doesn't look good. Could it be because when the Devs originally created these games, they didn't draw the models to be rendered by such a strong filter so they didn't add an extrordinary amount of detail to the model.

Thoughts?

I'll try hq2x! thanks

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 29 of 227, by GreatBarrier86

Posted on 2005-11-25, 07:39

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

Ok. somebody put that patch into a CVS build. i can barely even run hq2x. I'm running a 1.2 Ghz IBM thin 'n light so it's not the fastest. But i will say however that it did look great

Also, what do those numbers mean and why are the "old" numbers higher than the "new" numbers?

Reply 30 of 227, by neowolf

Posted on 2005-11-25, 07:57

neowolf Offline

Rank Member

Rank: Member
Posts: 115
Joined: 2005-08-31, 06:59

Once it's in a build it should be a lot more usable. I couldn't DREAM of using HQ2X without that patch.

(1.42GHz Mac mini)

Reply 31 of 227, by GreatBarrier86

Posted on 2005-11-25, 07:59

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

what makes hq2x so cpu intensive?

Also, what kind of comp does it require? Dual 3.6Ghz Pentium Ds with HT? (If they even make that)

Reply 32 of 227, by DosFreak

Posted on 2005-11-25, 08:04

DosFreak Offline

Rank l33t++

Rank: l33t++
Posts: 13536
Joined: 2002-06-30, 16:35
Location: Milliways

Well OPENGLQ is dependent more upon graphics card power than processing power.

Hq2x is dependant upon CPU processing power.

Bother offer pretty much the same graphics quality.

EDITED due to my confused self as Moe noted below

With my Athlon XP 2800+ and X800 AGP OpenGLHQ uses a bit more CPU than the other output methods but not much more as to make much of a difference.

Last edited by DosFreak on 2005-11-29, 06:17. Edited 1 time in total.

How To Ask Questions The Smart Way
Make your games work offline

Reply 33 of 227, by GreatBarrier86

Posted on 2005-11-25, 08:07

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

well dang. i've got a 852/855 Intel POS graphics "accelerator." Hopefully, once i can use this new patch, i'll be able to use hq2x

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 34 of 227, by Qbix

Posted on 2005-11-25, 08:30

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11323
Joined: 2002-11-27, 14:50
Location: Fryslan

btw
if I'm ever going to "accept" the patch I want the original renders be selectable with a define at compiletime.
This is for the systems with little memory like the pocketpc ports of dosbox.

Futher I wondered why you changed the prototype to use a void* instead of a Bit8u*.

Water flows down the stream
How to ask questions the smart way!

Reply 35 of 227, by Kronuz

Posted on 2005-11-25, 08:58

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

Qbix wrote:
btw
if I'm ever going to "accept" the patch I want the original renders be selectable with a define at compiletime.
This is for the systems with little memory like the pocketpc ports of dosbox.

That's something easy to do 😀 that's why I didn't delete the old scalers, I just commented them out. could you please suggest a good name for the define, or shall I just use my judgement?

Qbix wrote:

Futher I wondered why you changed the prototype to use a void* instead of a Bit8u*.

Oh, I didn't change it, the vesa16 patch did (I just wanted to make sure I was programming something that would work with 16 bits or 32 bits too. When I'm done with the patch I'll strip away all those extra "features and I'll just leave the optimized scalers (including my Normal3x and TVHq2x) and that define you said, to deactivate the optimizations (which indeed take about 2 MB of memory)

Kronuz
"Time is of the essence"

Reply 36 of 227, by `Moe`

Posted on 2005-11-25, 17:51

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

Kronuz wrote:
`Moe` wrote:
Optimizing the render loop for Hq2x makes quite a big difference.

Moe, could you elaborate a bit more on this? how exactly can the Hq2x be optimized if it's in its own loop?

My last hq2x modifications changed the render procedure: it used to be called once per frame, thus I could use clever loop arrangement to get noticeably more speed because I was exploting modern CPU's implicit prefetch. Moreover, hq2x is heavy on the CPU cache (easily uses all the available L1 cache), so if it renders one frame at a time, the cache is "warm" most of the time and works much better than now. Currently, hq2x is called once per scan line and must use temporary buffers to work around that, plus the external render code uses some of the availale L1 cache as well, plus when a frame is rendered in 4 parts, the L1 cache is trashed 4 times.

All this is not of much importance to other scalers, but due to the higher complexity of hq2x, it matters much more here. All this was verified using "oprofile" on linux (an awesome profiler which uses the performance counters and can tell where the CPU has L1 cache misses and many other sources of slowdown, not just "number of cpu cycles spent")

Reply 37 of 227, by `Moe`

Posted on 2005-11-25, 18:16

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

GreatBarrier86 wrote:
what makes hq2x so cpu intensive?

Also, what kind of comp does it require? Dual 3.6Ghz Pentium Ds with HT? (If they even make that)

Older patches worked on my old 300MHz laptop with a fair amount of frameskip, however. Newer releases got (a lot) slower, as already explained. My 1.4 Athlon (non-XP, non-64) runs hq2x quite nicely.

To understand why it is CPU intensive, I'll compare hq2x to advmame2x, which is the closest in spirit: Both look at surrounding pixels to find out if there's a line anywhere. If they find out that there is indeed a straight line, they draw that line "sharp". Advmamex does this via code like "if (pixel1 = pixel2) ...", so it only catches exactly equal color pixels/lines. Moreover, "if" queries are relatively slow on today's CPUs (think "branch prediction failure").

Hq2x does a similar thing, but there are three differences in the general principle and one difference in how I implemented it: 1) hq2x uses slightly different decisions when checking if there's a straight line somewhere (this is the reason for the rounded-vs-cornered look in avmame vs. hq2x); 2) it interpolates (a kind of antialiasing, in fact); 3) it doesn't check pixels for equality, but uses a fuzzy threshold, which takes some time to calculate; 4) it doesn't use "if" queries but a big lookup table that fits typical L1 caches _exactly_. (1) is irrelevant, (2) and (3) burn a lot of CPU cycles, (4) is the reason for the cache problems mentioned.

DosFreak, Hq2x does not use the video card at all. You are confusing this with OpenGL-HQ, which should look exactly the same, but runs on a (directx9-ish) video card.

Reply 38 of 227, by Kronuz

Posted on 2005-11-25, 18:23

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

`Moe` wrote:
Kronuz wrote:
`Moe` wrote:
Optimizing the render loop for Hq2x makes quite a big difference.

Moe, could you elaborate a bit more on this? how exactly can the Hq2x be optimized if it's in its own loop?

My last hq2x modifications changed the render procedure: it used to be called once per frame, thus I could use clever loop arrangement to get noticeably more speed because I was exploting modern CPU's implicit prefetch. Moreover, hq2x is heavy on the CPU cache (easily uses all the available L1 cache), so if it renders one frame at a time, the cache is "warm" most of the time and works much better than now. Currently, hq2x is called once per scan line and must use temporary buffers to work around that, plus the external render code uses some of the availale L1 cache as well, plus when a frame is rendered in 4 parts, the L1 cache is trashed 4 times.

All this is not of much importance to other scalers, but due to the higher complexity of hq2x, it matters much more here. All this was verified using "oprofile" on linux (an awesome profiler which uses the performance counters and can tell where the CPU has L1 cache misses and many other sources of slowdown, not just "number of cpu cycles spent")

I'ts interesting to know that DOSBox used to call the scalers once per frame instead of once per line, I suppose there aren't many things that changed that behavior (only moving a loop from one place to the other)
I don't have much experience with L1 cache and I'm not sure how it works, so right now I can only hope I'm writing L1 cache friendly code... (it would be nice to learn about it, but I haven't found a good article explaining it)

At any rate, the time spent in the scalers dropped dramatically for the hq2x scaler as soon as I implemented the changes now in my patch; and if only the loop and the use of the "cache of previous lines" (not the hardware L1 cache, but the software cache of the last couple lines) were optimized before, on your hq2x, then I suppose I have optimized it back again already as much as it is now possible with my current patch, and indeed I think it could be improved some more if scalers were to draw full frames instead of lines, avoiding a couple needless comparations in every single line and some expensive calls.

Kronuz
"Time is of the essence"

Reply 39 of 227, by `Moe`

Posted on 2005-11-25, 19:31

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

L1 cache (or L2 cache, for that matter) is quite straightforward:

Imagine you load a value from memory into a CPU register. 3 cases can happen:

1) The value is not in any cache. The CPU needs to access the system bus/RAM chips and wait for the data. Worth about 40(!) CPU clock cycles on current CPUs.
2) The value is in the L2 cache. No bus access neccessary, but still ~15 cycles.
3) The value is in the L1 cache. 3-5 cycles.

In each case, L1 and L2 cache then contain the loaded value. Some other (older) value will be thrown out, if neccessary.

You can see, memory-intensive code is highly dependent on cache. The CPU uses automatic prefetch to load 32 or even 64 bytes at a time, so when you access a byte, chances are high that nearby bytes are already in the L1 cache.

A clever loop arrangement tries to acces bytes that are near each other, and arranges calculation and loading so that while the CPU is waiting the above 40 cycles, it has some number-crunching on already loaded values to do during that time. HT technology is basically an attempt at doing exactly this - while one thread may access memory, another one can do some math. That's why well optimized apps do not profit from HT, they already see to that themselves.

Main menu

Topic actions

Reply 20 of 227, by GreatBarrier86

Reply 21 of 227, by neowolf

Reply 22 of 227, by Kronuz

Reply 23 of 227, by GreatBarrier86

Reply 24 of 227, by DosFreak

Reply 25 of 227, by Kronuz

Reply 26 of 227, by GreatBarrier86

Reply 27 of 227, by neowolf

Reply 28 of 227, by GreatBarrier86

Reply 29 of 227, by GreatBarrier86

Reply 30 of 227, by neowolf

Reply 31 of 227, by GreatBarrier86

Reply 32 of 227, by DosFreak

Reply 33 of 227, by GreatBarrier86

Reply 34 of 227, by Qbix

Reply 35 of 227, by Kronuz

Reply 36 of 227, by `Moe`

Reply 37 of 227, by `Moe`

Reply 38 of 227, by Kronuz

Reply 39 of 227, by `Moe`