Graphics performance boost

Reply 40 of 227, by Kronuz

Posted on 2005-11-25, 19:55

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

`Moe` wrote:
L1 cache (or L2 cache, for that matter) is quite straightforward: […]
Show full quote
L1 cache (or L2 cache, for that matter) is quite straightforward:

Imagine you load a value from memory into a CPU register. 3 cases can happen:

1) The value is not in any cache. The CPU needs to access the system bus/RAM chips and wait for the data. Worth about 40(!) CPU clock cycles on current CPUs.
2) The value is in the L2 cache. No bus access neccessary, but still ~15 cycles.
3) The value is in the L1 cache. 3-5 cycles.

In each case, L1 and L2 cache then contain the loaded value. Some other (older) value will be thrown out, if neccessary.

You can see, memory-intensive code is highly dependent on cache. The CPU uses automatic prefetch to load 32 or even 64 bytes at a time, so when you access a byte, chances are high that nearby bytes are already in the L1 cache.

A clever loop arrangement tries to acces bytes that are near each other, and arranges calculation and loading so that while the CPU is waiting the above 40 cycles, it has some number-crunching on already loaded values to do during that time. HT technology is basically an attempt at doing exactly this - while one thread may access memory, another one can do some math. That's why well optimized apps do not profit from HT, they already see to that themselves.

Thanks Moe 😀
It's funny, because that's what I though L1 and L2 cache was, but I didn't know it for sure, and I didn't know it made such a big difference (40 cycles, that's a lot)... that also goes for code, right, i.e. code is loaded to cache as well I guess.
Also, I suppose when the CPU caches memory (as you said, the nearby bytes) it caches the following bytes, not the previous ones, and that would explain a performance degradation I saw while profiling my patch, when instead of going forward in the line looking for differences I was going backwards; for instance I was trying 'mem--; for(int i=val; i>0; i++) mem;' 'cause it logically should be faster than 'for(int i=0; i<val; i++) mem;' but it isn't so... I suppose this is due the fact that when the CPU loads 'mem', it probably loads from about '&mem' to '&mem[i+64]' to the cache, but it doesn't know anything about the bytes that are before that; hence every cycle it has to access the bus as you said to load the byte before the current one... otherwise I can't explain the performance degradation on that code... what do you think?

Kronuz
"Time is of the essence"

Reply 41 of 227, by DosFreak

Posted on 2005-11-26, 02:23

DosFreak Offline

Rank l33t++

Rank: l33t++
Posts: 13536
Joined: 2002-06-30, 16:35
Location: Milliways

`Moe` wrote:

DosFreak, Hq2x does not use the video card at all. You are confusing this with OpenGL-HQ, which should look exactly the same, but runs on a (directx9-ish) video card.

DOH! Yep, your right! That's exactly what I did. I'm an idiot. 😀

How To Ask Questions The Smart Way
Make your games work offline

Reply 42 of 227, by GreatBarrier86

Posted on 2005-11-26, 02:49

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

So then what is the point of using OpenGL-HQ if hq2x does the same? To offload the processing from the CPU onto the GPU?

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 43 of 227, by neowolf

Posted on 2005-11-26, 03:20

neowolf Offline

Rank Member

Rank: Member
Posts: 115
Joined: 2005-08-31, 06:59

GreatBarrier86 wrote:
So then what is the point of using OpenGL-HQ if hq2x does the same? To offload the processing from the CPU onto the GPU?

Exactly, as you already noticed it's fairly demanding.

Reply 44 of 227, by gulikoza

Posted on 2005-11-26, 08:02

gulikoza Offline

Rank Oldbie

Rank: Oldbie
Posts: 1705
Joined: 2004-06-25, 14:53

Ok, I've done some tests and I see 2 issues.

1. if we're gonna have normal3x, it has to switch to hardware scaling when possible (line 347, render.cpp) should be something like:

1if (gfx_flags & HAVE_SCALING && (render.op.type==OP_Normal2x || render.op.type==OP_Normal3x)) {
2		if (dblw) gfx_scalew*=(render.op.type==OP_Normal2x) ? 2 : 3;
3		if (dblh) gfx_scaleh*=(render.op.type==OP_Normal2x) ? 2 : 3;
4		block=&Normal_8;
5		render.op.type=OP_Normal;
6}

2. I'm getting weird artifacts with direct3d output. Are old pixels written to the screen from cache or is it just assumed the old values are still there? How much slower it would be to write the old values to the screen as well (ATM D3D needs each frame completely updated, I'll see if I can solve this).

I'll post an updated cvs build when I solve D3D issue...

Reply 45 of 227, by `Moe`

Posted on 2005-11-26, 16:08

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

Kronuz wrote:
It's funny, because that's what I though L1 and L2 cache was, but I didn't know it for sure, and I didn't know it made such a big difference (40 cycles, that's a lot)... that also goes for code, right, i.e. code is loaded to cache as well I guess.
Also, I suppose when the CPU caches memory (as you said, the nearby bytes) it caches the following bytes, not the previous ones, and that would explain a performance degradation I saw while profiling my patch, when instead of going forward in the line looking for differences I was going backwards; for instance I was trying 'mem--; for(int i=val; i>0; i++) mem;' 'cause it logically should be faster than 'for(int i=0; i<val; i++) mem;' but it isn't so... I suppose this is due the fact that when the CPU loads 'mem', it probably loads from about '&mem' to '&mem[i+64]' to the cache, but it doesn't know anything about the bytes that are before that; hence every cycle it has to access the bus as you said to load the byte before the current one... otherwise I can't explain the performance degradation on that code... what do you think?

By the way, those 40 cycles should have read _at least_ 40 cycles. Put some DMA load on the bus and stuff... you get the idea.

Cache loads are a little different than you think: One bunch of cache values is called a "cache line". Cache lines are 32 or 64 bytes on current CPUs. The CPU accesses main memory aligned to this cache line size, i.e. if you happen to load a value at address 16, bytes at address 0-31 will be loaded. That also means that backward loops should be as fast as forward loops.

There's a little trick, however, because modern CPUs try to be even smarter: If they notice you doing mem[i++], they recognize you will soon need the next cache line and fetch it _before_ you access any of it. Maybe on your CPU, backward-counting doesn't trigger this heuristic. There's a bunch of other stuff going on there, you should read the AMD or Intel manuals on optimization techniques to get an idea what ways exist to use the cache most efficiently.

Reply 46 of 227, by `Moe`

Posted on 2005-11-26, 16:11

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

GreatBarrier86 wrote:
So then what is the point of using OpenGL-HQ if hq2x does the same? To offload the processing from the CPU onto the GPU?

That, and the ability of scaling by _any_ factor. Think of OpenGL-HQ as Hq2x, Hq3x, Hq4x, Hq12x (if your GPU can handle that 😉 ),or even Hq2.5x.Aspect-corrected output looks much better there. It's so much more flexible, even though I implemented both, these days I only use OpenGL-HQ, not Hq2x.

Reply 47 of 227, by GreatBarrier86

Posted on 2005-11-26, 16:57

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

Very nice. So in theory, you could take a game in 320x200 like JR and scale it to look like it's in an ultra high rez if you did hq24x? Ooo...here is a thought...do the current scalers support SLI? That might help with really high HQ settings?

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 48 of 227, by BrainStorm_wow

Posted on 2005-11-26, 17:05

BrainStorm_wow Offline

Rank Newbie

Rank: Newbie
Posts: 15
Joined: 2005-11-26, 17:01

HOW DO I USE PATCHES?! I HAVE LOOKED FOR INFO EVERYWHERE AND FOUND NOTHING PLEASE HELP!

Reply 49 of 227, by GreatBarrier86

Posted on 2005-11-26, 17:15

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

I've give a little bit of an answer. You cannot just add them to the EXE. You need to recompile the program using the source code and add in that patch.

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 50 of 227, by DosFreak

Posted on 2005-11-26, 17:26

DosFreak Offline

Rank l33t++

Rank: l33t++
Posts: 13536
Joined: 2002-06-30, 16:35
Location: Milliways

GreatBarrier86 wrote:
Very nice. So in theory, you could take a game in 320x200 like JR and scale it to look like it's in an ultra high rez if you did hq24x? Ooo...here is a thought...do the current scalers support SLI? That might help with really high HQ settings?

Not necessary. I just scaled Monkey Island 1 up to 2048x1536 (yeah yeah not a proper ratio). Just to see how well OpenglHQ could handle it......and CPU utilization was the same as at my usual desktop size of 1280x1024.

SLI is only usefull for games that actually NEED it (Texture data, FSAA). DosBox will never need that much horsepower. 😀

How To Ask Questions The Smart Way
Make your games work offline

Reply 51 of 227, by Kronuz

Posted on 2005-11-26, 17:39

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

gulikoza wrote:
2. I'm getting weird artifacts with direct3d output. Are old pixels written to the screen from cache or is it just assumed the old values are still there? How much slower it would be to write the old values to the screen as well (ATM D3D needs each frame completely updated, I'll see if I can solve this).

I'll try with D3D, as I haven't tested it yet... and yes, right now, I'm only painting the parts that changed, assuming the others are still there... which is probably not the case with D3D or even other hardware accelerated buffers... Supposedly the old content should remain, but in reality it's said that it's up to the hardware & drivers to decide, so I guess there's no way of telling... Always painting the missing parts too would kill much of the speed improvement gained with my patch, but there are probably other ways around like updating regions or something.

Kronuz
"Time is of the essence"

Reply 52 of 227, by Kronuz

Posted on 2005-11-26, 17:53

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

BrainStorm_wow wrote:
HOW DO I USE PATCHES?! I HAVE LOOKED FOR INFO EVERYWHERE AND FOUND NOTHING PLEASE HELP!

You can always get one of my builds for windows (with my most recent patch included as well as coreswitch and tymesync) from my ftp... but only when I'm online.

ftp://kronuz.no-ip.com/pub/DOSBox/dosbox-kronuz-latest.zip

atm, everything is linked staticaly so you don't need any DLLs 😀

Kronuz
"Time is of the essence"

Reply 53 of 227, by GreatBarrier86

Posted on 2005-11-26, 18:56

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

Do you think i could pull of OpenGL HQ with my 852/855GME chipset? It's an intel integrated

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 54 of 227, by neowolf

Posted on 2005-11-26, 20:19

neowolf Offline

Rank Member

Rank: Member
Posts: 115
Joined: 2005-08-31, 06:59

If I remember correctly in order to even use OpenGL-HQ you need a fairly recent chipset. I've got a Radeon 9200 that I believe is a step too far back to do it. If that's the case I find it fairly unlikely an Intel chipset could do the trick.

Reply 55 of 227, by Kronuz

Posted on 2005-11-26, 20:54

Kronuz Offline

Rank Member

Rank: Member
Posts: 103
Joined: 2005-11-23, 19:27

Okay, I have updated the patch and added a way to compile either the unoptimized version of the scalers or the regular version of them by defining UNOPTIMIZED_SCALERS.

Also, I have stripped the Vesa16 support from the patch so now it's "stand-alone", containing only the Hq2x scaler (as well as the TVHq2x and the Normal3x) and the optimization for all the scalers.
I also fixed the issue of the screen not being updated after switching fullscreen/windowed mode.

Get the patch at the top of the page: Version 5, RC1

Kronuz
"Time is of the essence"

Reply 56 of 227, by `Moe`

Posted on 2005-11-26, 22:42

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

GreatBarrier86 wrote:
Do you think i could pull of OpenGL HQ with my 852/855GME chipset? It's an intel integrated

No idea, though I doubt it. You need the ARB_fragment_program OpenGL extension (hardware-accelerated), which roughly compares to Direct3D PixelShader 1.4 or 2.0 (don't know exactly). For ATI, this is a Radeon 9500 and up, for NVidia, this is a GF5 or later. I think the 855 is not that powerful.

Reply 57 of 227, by GreatBarrier86

Posted on 2005-11-26, 23:14

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

"not that powerful?" HA. I can get 2 fps in Half Life 2 with the settings turned all the way down. BEAT THAT!!

*sarcasm* heehe

IBM ThinkPad X40
1.2Ghz with 2MB L2 cache
1.0GB DDR2 SDRAM
Intel 852/855GME Graphics Media Accelerator with 64MB
12'' LCD Screen
SoundMax Integrated Audio

Peas Pobie!

Reply 58 of 227, by GreatBarrier86

Posted on 2005-11-26, 23:17

GreatBarrier86 Offline

Rank Newbie

Rank: Newbie
Posts: 78
Joined: 2005-03-10, 19:02
Location: Atlanta, GA

Also, is there anyway to write dosbox to utilize ram? I mean more than the standard. I know CPU cycles are important and ram is very unimportant in comparison but i have 1 gb of ram and i wonder if it would be possible to optimize dosbox to use this?

This isn't a request, it's just a question. 😀

Reply 59 of 227, by Reckless

Posted on 2005-11-26, 23:38

Reckless Offline

Rank Oldbie

Rank: Oldbie
Posts: 556
Joined: 2003-08-07, 00:32

GreatBarrier86 wrote:
Also, is there anyway to write dosbox to utilize ram? I mean more than the standard. I know CPU cycles are important and ram is very unimportant in comparison but i have 1 gb of ram and i wonder if it would be possible to optimize dosbox to use this?

This isn't a request, it's just a question. 😀

Hopefully a mod will move this off-topic post [and my reply] from this thread! Consider your question's relevance prior to posting it in an existing thread!

DOS systems didn't have much RAM and in are limited to the maximum that DOS extenders provided (64/128MB max depending on the extender used IIRC). I cannot see anything that you could place in RAM that would actually benefit the running games in DOSBox.

Main menu

Topic actions

Reply 40 of 227, by Kronuz

Reply 41 of 227, by DosFreak

Reply 42 of 227, by GreatBarrier86

Reply 43 of 227, by neowolf

Reply 44 of 227, by gulikoza

Reply 45 of 227, by `Moe`

Reply 46 of 227, by `Moe`

Reply 47 of 227, by GreatBarrier86

Reply 48 of 227, by BrainStorm_wow

Reply 49 of 227, by GreatBarrier86

Reply 50 of 227, by DosFreak

Reply 51 of 227, by Kronuz

Reply 52 of 227, by Kronuz

Reply 53 of 227, by GreatBarrier86

Reply 54 of 227, by neowolf

Reply 55 of 227, by Kronuz

Reply 56 of 227, by `Moe`

Reply 57 of 227, by GreatBarrier86

Reply 58 of 227, by GreatBarrier86

Reply 59 of 227, by Reckless