size optimization. \ VOGONS

size optimization.

Topic actions

First post, by ih8registrations

Posted on 2007-09-23, 19:50

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

I've only gone through a little bit, dma.cpp and mixer.cpp. For instance, dma.o is 821 bytes smaller. Less cache thrashing. Size drops noted for various changes in the comments. edit: changes in mouse aren't size opt, just some other changes that slipped into the patch, mostly applying the bound optimization you can see in mixer.cpp.

Reply 1 of 12, by ih8registrations

Posted on 2007-09-24, 14:04

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

More; a notable one being shaved ~800 bytes off of adlib/opl handling.

Reply 2 of 12, by wd

Posted on 2007-09-24, 17:46

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

For the

1 	if (CaptureState & (CAPTURE_WAVE|CAPTURE_VIDEO)) {
2+        MIXER_Capture(needed);

would it have the same/a similar effect if GCC_UNLIKELY would be used?

Reply 3 of 12, by ih8registrations

Posted on 2007-09-24, 19:47

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Maybe if it had been predicting to capture, since then GCC_UNLIKELY would then be telling the cpu to speculatively process the end of mixdata instead.

Reply 4 of 12, by ih8registrations

Posted on 2007-09-25, 13:46

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

I would say then, not really,that just covers prefetch. If you're a bit rusty on caching, like me, there's mapped to fully associative, with n-way associative in between. Memory locations map to particular cache lines, with mapped it's one to one, with full accoc. any address can be mapped to a cache line, and n-way, n addresses are mapped to a particular cache line. With n-way, if I understand correctly, they go by page/high bits in mapping addresses. I've been trying to find a reference that gives a non-psuedo example of address calculation to no avail. There's the question of where the function starts in a cache line, not necessarily at the beginning?, and determining the cache line for skipping past capture handling, in calculating how many cache lines the function's common path would reside.

Reply 5 of 12, by ih8registrations

Posted on 2007-09-25, 16:22

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

The beginning of mixdata is 274 bytes, the end is 276 bytes, function call is four, passing Bitu needed is four, and cache lines are 64 bytes. If one were to assume the start of the function was the start of a cache line, the last cache line filled for the beginning of mixdata not only includes the function call, but the instructions past it. The cache line fills would be consecutive and so it would be 558/64=8.71 or 9 cache lines. The capture handling is 125 bytes, which would have it skip down to past capture handling and start a new cache line fill. 274/64=4.28 or 5 cache lines + 276/64=4.31 or 5 cache lines; 10 cache lines.

Reply 6 of 12, by wd

Posted on 2007-09-25, 16:43

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Cache associativity, cache line size, total cache size, line replacement strategies
are quite different over the various processor types and their cache levels,
so you should better not assume too much about these parameters.
Maybe Moe wants to tell more about that.

Reply 7 of 12, by ih8registrations

Posted on 2007-09-25, 16:55

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Regardless, it improves the chances of avoiding an extra cache fill. What's the argument for keeping it inlined?

Reply 8 of 12, by wd

Posted on 2007-09-25, 17:11

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

None. Was just assuming that gcc would rearrange the code so that
the part after the if() is moved way out of sight, so it would be more
effective than a call.

Reply 9 of 12, by ih8registrations

Posted on 2007-09-25, 18:18

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

That's possible.

Reply 10 of 12, by ih8registrations

Posted on 2007-09-26, 13:42

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

+pic

There's a bug in this one.

Last edited by ih8registrations on 2007-09-26, 16:32. Edited 1 time in total.

Reply 11 of 12, by `Moe`

Posted on 2007-09-26, 14:02

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

About caches:

Imagine a non-associative 4KB cache: A memory location can't be put anywhere in the cache, but only at the location that matches it's lowest 12 bits. So address 0x0000 and address 0x1000 share the same cache location. If you load a continuous area, like a piece of linear code, it doesn't matter, since that is, by definition, linear. If 0x0000 is in the cache, and 0x1000 is to be loaded, then 0x0000 must be removed, even if the rest of the cache is unused.

Now imagine a 16KB Cache, 4-way associative. You can think of it as 4 non-associative caches, 4KB each. If 0x0000 is loaded and 0x1000 is to be loaded, 0x1000 is put in the second "way". 0x2000 and 0x3000 can also be loaded, and only when 0x4000 is loaded, one of the 4 previous locations will be thrown out.

4-way is quite common. If your code+data fits into 64kb (even if split across 2-4 distinct memory blocks), practically every CPU since the Pentium 2 will fit it into it's L1 cache. Use a Linux machine and OProfile to get a detailed profile on what locations suffer the most cache misses.

GCC will obey the GCC_UNLIKELY flag and put that code path out of the way, arranging code flow for minimal branching. If you use PGO, you don't even need GCC_UNLIKELY -- the profile data will be used to decide which code path is the hottest (that was the very first use of PGO). Use an as new gcc as you can get, as the PGO features are still quite new and new profile-guided optimization steps are added all the time.

Reply 12 of 12, by ih8registrations

Posted on 2007-09-26, 20:40

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

The bug is an issue with pointing to a function inside c++ namespace. Pukes from oplwrite= OPL2::YM3812Write/THEOPL3::YMF262Write iin adlib.cpp. Shows up with Master of Orion. Oh, how I love c++ and oop.. Anyone know how to get past this retardedness?

Go to top of page Go to top of page

Back to DOSBox Development