unaligned handlers(all of them.) \ VOGONS

unaligned handlers(all of them.)

Topic actions

Post a reply

First post, by ih8registrations

Posted on 2007-08-27, 05:17

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

The idea of those handlers is to avoid the penatly of odd memory accesses correct? If you consider what the penalties are and look at what the handlers do, it's apparent they're counter productive. The penalty for an odd addressed word acccess is an additional access. The penalty for a dword access with an address not divisible by four is also an additional access. If you take a look at the handlers, for example:

1if ( addr & 1) {
2	*val =
3		(readHandler<Bit8u>( addr+0 ) << 0 ) | 
4		(readHandler<Bit8u>( addr+1 ) << 8 );
5} else {
6		*val = host_readw( &vga.mem.linear[((addr&~3)<<2)+(addr&3)]  );  
7}

It still does two memory accesses, but also an additional function call, move, addition, shift, or, and, & jump.

The difference is even more so with the dword handling:

1		if ( addr & 3) {
2			*val =
3				(readHandler<Bit8u>( addr+0 ) << 0 ) | 
4				(readHandler<Bit8u>( addr+1 ) << 8 ) | 
5				(readHandler<Bit8u>( addr+2 ) << 16 ) | 
6				(readHandler<Bit8u>( addr+3 ) << 24 );
7		} else {
8			*val = host_readd( &vga.mem.linear[((addr&~3)<<2)+(addr&3)]  ); 
9		}

The odd dword addressing costs an extra access vs three additional: accesses, function calls, additions, moves, shifts, & ors, & one additional: and, compare, & jump.

Reply 1 of 23, by vasyl

Posted on 2007-08-27, 09:41

vasyl Offline

Rank Oldbie

Rank: Oldbie
Posts: 680
Joined: 2005-03-27, 04:53

There is the second penalty for unaligned memory accesses on some platforms -- crash. For that we have C_UNALIGNED_MEMORY. The code in question is agnostic and portable but you are right -- it is not very efficient. At least on targets that allow unaligned memory accesses. I don't expect compiler to be able to inline read/writeHandlers. So, the best solution would be to make this code to respect C_UNALIGNED_MEMORY and use the simpler one-line form if it is set to 1. It may not cause as great optimization as you might expect: most games tend to use aligned accesses anyway.

Reply 2 of 23, by wd

Posted on 2007-08-27, 09:49

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Um the dynamic core has its own handlers, and for the rest you're better
off changing the conditions to ((address & 0xfff)<0xffd for architectures
that allow little-endian unaligned fast access.

Reply 3 of 23, by ih8registrations

Posted on 2007-08-27, 23:53

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

vasyl; yes, I'm aware of the preprocessor directives and already using such with my mem copy/r/w changes(big endian handlers devolve to byte accesses, so optimizations for them would just cause more overhead). I didn't give an expectation. I also noted the tendency for aligned accesses to wd in another thread, which is another reason why there's no benefit to checking for the condition. The overhead isn't all done inside all the unaligned handlers, as the setup cmp variable,cmp,jmp are done in the regular memory handlers(mem_(read/write)(w/d)_inline.).

"Um the dynamic core has its own handlers" Which disproves my assertion how, wd?

Reply 4 of 23, by wd

Posted on 2007-08-28, 07:26

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Which disproves my assertion how, wd?

Not at all, but i don't see any reason to bother with functions that are
used 0.01% of the time (except for wolf3d of course, which, yet, runs at
5000 cycles) compared to other memory access functions.

Suppose with the current checks for unaligned/big endian memory access
all functions can use the <0xfff/0xffd style of page boundary checking
(see my first reply).

Reply 5 of 23, by ih8registrations

Posted on 2007-08-28, 09:26

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

As I mentioned, the overhead doesn't just reside in the unaligned handlers themselves. Here's the gprof output for dawn patrol, which doesn't run at full speed on my system.

1Each sample counts as 0.01 seconds.
2  %   cumulative   self              self     total           
3 time   seconds   seconds    calls   s/call   s/call  name    
4 17.96     18.80    18.80 23740620     0.00     0.00  CPU_Core_Dyn_X86_Run()
5 16.72     36.30    17.50                             mem_readw_checked_x86(unsigned int, unsigned short*)
6 13.28     50.20    13.90                             mem_writew_checked_x86(unsigned int, unsigned short)
7 10.52     61.21    11.01   383891     0.00     0.00  THEOPL3::YMF262UpdateOne(int, short*, int)
8  7.12     68.66     7.45 10083479     0.00     0.00  RENDER_StartLineHandler(void const*)
9  4.78     73.66     5.00 417184353     0.00     0.00  MakeCodePage(unsigned int, CodePageHandler*&)
10  3.65     77.48     3.82 37058324     0.00     0.00  PAGING_LinkPage(unsigned int, unsigned int)
11  1.78     79.34     1.86  7673278     0.00     0.00  PAGING_ClearTLB()
12  1.43     80.84     1.50   792893     0.00     0.00  Normal1x_8_32_R(void const*)
13  1.39     82.29     1.45 152567172     0.00     0.00  mem_readw(unsigned int)
14  1.15     83.49     1.20 30508656     0.00     0.00  PAGING_MapPage(unsigned int, unsigned int)
15  0.99     84.53     1.04                             CPU_SetSegGeneral(SegNames, unsigned int)
16  0.98     85.56     1.03  7592004     0.00     0.00  MEM_NextHandleAt(int, unsigned int)
17  0.93     86.53     0.97                             __moddi3
18  0.90     87.47     0.94 24508508     0.00     0.00  PIC_RunQueue()
19  0.90     88.41     0.94 20840779     0.00     0.00  vga_read_p3da(unsigned int, unsigned int)
20  0.71     89.15     0.74                             __divdi3
21  0.69     89.87     0.72 32251920     0.00     0.00  CodePageHandler::writew_checked(unsigned int, unsigned int)
22  0.64     90.54     0.67 31989488     0.00     0.00  mem_readb(unsigned int)
23  0.64     91.21     0.67                             dyn_helper_divw(unsigned short)
24  0.64     91.88     0.67                             dyn_helper_idivw(short)
25  0.59     92.50     0.62                             CPU_RET(bool, unsigned int, unsigned int)
26  0.54     93.06     0.56                             CPU_CALL(bool, unsigned int, unsigned int, unsigned int)
27  0.51     93.59     0.53                             mem_writed_checked_x86(unsigned int, unsigned int)
28  0.49     94.10     0.51 55467854     0.00     0.00  mem_writew(unsigned int, unsigned short)
29  0.49     94.61     0.51                             mem_writeb_checked_x86(unsigned int, unsigned char)
30  0.46     95.09     0.48  7518593     0.00     0.00  EMM_MapSegment(unsigned int, unsigned short, unsigned short)
31  0.44     95.55     0.46 21190280     0.00     0.00  CodePageHandler::GetHostReadPt(unsigned int)
32  0.41     95.98     0.43  5710069     0.00     0.00  CPU_Core_Normal_Run()
33  0.40     96.40     0.42 16353107     0.00     0.00  INT16_Handler()
34  0.40     96.82     0.42                             mem_readd_checked_x86(unsigned int, unsigned int*)

31% of the total consist of mem_xxxxx_checked_x86 calls.

1[9]     17.8   17.50    1.15                 mem_readw_checked_x86(unsigned int, unsigned short*) [9]
2                0.16    0.99 6367002/6367002     InitPageHandler::readw_checked(unsigned int, unsigned int*) [39]
3                0.00    0.00    6316/6316        mem_unalignedreadw_checked_x86(unsigned int, unsigned short*)

With almost all of the time spent by the functions themselves, not functions they call, and they're inlined. The most used checked functions called enough to run out the gprof call counter, which means minor improvements should yield bigger improvements overall.

1INLINE bool mem_readw_checked_x86(PhysPt address, Bit16u * val) {
2	if ((address & 0xfff)<0xfff) {
3		Bitu index=(address>>12);
4		if (paging.tlb.read[index]) {
5			*val=host_readw(paging.tlb.read[index]+address);
6			return false;
7		} else {
8			Bitu uval;
9			bool retval;
10			retval=paging.tlb.handler[index]->readw_checked(address, &uval);
11			*val=(Bit16u)uval;
12			return retval;
13		}
14	} else return mem_unalignedreadw_checked_x86(address, val);
15}

As you can see, there's unaligned handling overhead in these big-o routines.

Reply 6 of 23, by wd

Posted on 2007-08-28, 11:02

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

No they aren't inlined (see dyn_read_word), only dword and byte functions are.

Reply 7 of 23, by `Moe`

Posted on 2007-08-28, 11:41

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

Slightly off-topic: If you're doing serious profiling, you should definitely take a look at oprofile instead of gprof, a kernel-based profilder for Linux. With it, you can profile regular, unmodified optimized builds. Combined with gcc-4, which has great support for debugging information even in highly optimized builds, it is the most accurate way of obtaining a profile.

Back when I tried to optimize DOSBox for my old 333MHz laptop so that I could play X-Com with hq2x, it proved invaluable and showed me problems (like excessive cache misses) that gprof would never have indicated.

Reply 8 of 23, by wd

Posted on 2007-08-28, 11:55

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Most fail to correctly account the time spent in code generated by the
dynamic core anyways, can oprofile do that? Only one i found useful so
far was the amd analyzer (suppose the intel one would do it as well).

Reply 9 of 23, by ih8registrations

Posted on 2007-08-28, 12:50

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

The "they" being referred to are the mem_xxxxx_checked_x86 functions, all of which, just like the mem_readw_checked_x86 code shows, are inlined.

Reply 10 of 23, by wd

Posted on 2007-08-28, 13:30

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

No, it isn't inlined. Again, see dyn_read_word() which generates code to
call this function (ie. no inlining).

Reply 11 of 23, by `Moe`

Posted on 2007-08-28, 19:30

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

wd: oprofile can generate annotated assembler listings. Without some sort of debugging symbols, custom generated code can't be resolved to source code, however. GCC 4 has become very good at tracing back optimized and inlined code to the right source location, but dynamically generated code probably must be processed differently (pattern matching?).

If you're just after the dynamic code's runtime in general, then I think oprofile has the right tools to account for them. After all, you can use all of your CPUs performance counters (even multiple in parallel) on the binary.

Reply 12 of 23, by wd

Posted on 2007-08-28, 19:39

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Nevertheless, the stats ih8regs posted are valid, they show that the word
access functions suck a good part of the cpu time (even if the percentage
is lower due to the unaccounted time spent in generated code).

Reply 13 of 23, by ih8registrations

Posted on 2007-08-28, 22:13

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

OK I see it, the address getting passed in the dyn_* functions. Valgrind might be able to track the dyn code, as it's supposed to be able to handle most self modifying code, or is it one one of those you tried that failed to account for the dyn time?

Reply 14 of 23, by ih8registrations

Posted on 2007-08-28, 23:19

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

One question is how much of the mem_xxxxx_checked_x86 function's time is the unaligned overhead? To find out, here's the gprof line-by-line(line instead of function) profiling:

1Each sample counts as 0.01 seconds.
2  %   cumulative   self              self     total           
3 time   seconds   seconds    calls  ms/call  ms/call  name    
4 15.81      6.74     6.74                             mem_readw_checked_x86(unsigned int, unsigned short*)
5  5.02      8.88     2.14                             Z23RENDER_StartLineHandlerPKv (render.cpp:105 @ 4ca702)
6  4.48     10.79     1.91                             mem_writew_checked_x86(unsigned int, unsigned short) (paging.h:321 @ 5592e0)
7  4.22     12.59     1.80                             Z20CPU_Core_Dyn_X86_Runv (cache.h:272 @ 423271)
8  4.10     14.34     1.75                             Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:141 @ 4232a4)
9  2.13     15.25     0.91                             Z22mem_writew_checked_x86jt (paging.h:324 @ 559311)
10  1.59     15.93     0.68                             ZN7THEOPL315YMF262UpdateOneEiPsi (ymf262.c:693 @ 4bf185)
11  1.36     16.51     0.58 174113113     0.00     0.00  MakeCodePage(unsigned int, CodePageHandler*&) (decoder.h:53 @ 40c4b0)
12  1.20     17.02     0.51                             Z22mem_writew_checked_x86jt (paging.h:323 @ 55930c)
13  1.15     17.51     0.49                             Z20CPU_Core_Dyn_X86_Runv (cache.h:269 @ 4232e6)
14  1.15     18.00     0.49                             Z20CPU_Core_Dyn_X86_Runv (core_dyn_x86.cpp:280 @ 4232e4)
15  1.03     18.44     0.44                             Z22mem_writew_checked_x86jt (paging.h:322 @ 559300)
16  0.94     18.84     0.40                             Z12MakeCodePagejRP15CodePageHandler (decoder.h:96 @ 40c519)
17  0.90     19.23     0.39                             Z20CPU_Core_Dyn_X86_Runv (cache.h:270 @ 423263)
18  0.87     19.59     0.37 106855093     0.00     0.00  mem_readw(unsigned int) (memory.cpp:546 @ 4b4790)
19  0.83     19.95     0.35                             Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:142 @ 4232b4)
20  0.82     20.30     0.35                             Z15PAGING_LinkPagejj (paging.cpp:359 @ 40a1cd)
21  0.82     20.65     0.35                             Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:130 @ 423290)
22  0.82     21.00     0.35                             __moddi3
23  0.81     21.34     0.34                             Z15PAGING_LinkPagejj (paging.cpp:365 @ 40a1fc)
24  0.80     21.68     0.34                             Z20CPU_Core_Dyn_X86_Runv (core_dyn_x86.cpp:297 @ 423297)
25  0.76     22.01     0.33                             Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:142 @ 4232ca)
26  0.73     22.32     0.31                             Z15dyn_helper_divwt (helpers.h:26 @ 41018e)
27  0.73     22.63     0.31                             Z22mem_writew_checked_x86jt (paging.h:329 @ 559325)
28  0.70     22.93     0.30                             Z15Normal1x_8_32_RPKv (render_simple.h:55 @ 4df525)
29  0.70     23.23     0.30                             Z20CPU_Core_Dyn_X86_Runv (core_dyn_x86.cpp:280 @ 423285)
30  0.69     23.52     0.29                             Z22mem_writew_checked_x86jt (paging.h:329 @ 559320)
31  0.69     23.82     0.29                             Z22mem_writew_checked_x86jt (paging.h:326 @ 559323)
32  0.66     24.10     0.28                             Z20CPU_Core_Dyn_X86_Runv (cache.h:273 @ 423278)

Using mem_writew_checked_x86 to determine that; adding all of its line times(not all shown), is 5 seconds.

1.03 18.44 0.44 Z22mem_writew_checked_x86jt (paging.h:322 @ 559300)

is "if ((address & 0xfff)<0xfff) {."

.44/5 or 8.8% of the total function time. 8.8% of 31% total time is ~2.7% of total runtime being the unaligned overhead, when unaligned isn't handled.

I'm not sure whether I should add:

0.69 23.52 0.29 Z22mem_writew_checked_x86jt (paging.h:329 @ 559320)

which is the end jump. There's two times for it, which I think is the time for the immediate jump of the fall through and exit jump after unaligned handling, so I'm thinking not.

Last edited by ih8registrations on 2007-08-28, 23:52. Edited 2 times in total.

Reply 15 of 23, by `Moe`

Posted on 2007-08-28, 23:43

`Moe` Offline

Rank Oldbie

Rank: Oldbie
Posts: 1169
Joined: 2004-04-29, 01:06
Location: Oldenburg, Germany

Oprofile can tell you that, given a CPU with a corresponding performance counter register. (All recent CPUs probably can track these problems, just don't expect too much on a pentium 2/3 😉 )

Reply 16 of 23, by ih8registrations

Posted on 2007-08-28, 23:53

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Yeah, but I've been a lazy bastard and havn't setup a Linux box in order run oprofile and valgrind;)

Reply 17 of 23, by wd

Posted on 2007-08-29, 12:17

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

.44/5 or 8.8% of the total function time. 8.8% of 31% total time is
~2.7% of total runtime being the unaligned overhead, when unaligned
isn't handled.

Well but can't remove them as they care about page crossings.
Dunno if there's a faster way to check for that.

Reply 18 of 23, by ih8registrations

Posted on 2007-08-29, 20:21

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Need to rewrite it so that it doesn't need to check(qemu doesn't require/do such.) I need better understanding of dosbox's dependency & use of it to figure out a rewrite.

Reply 19 of 23, by wd

Posted on 2007-08-29, 20:30

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

It can be rewritten so it only uses handlers (think you still need the
page boundary checks), but that wasn't faster (Harekiet did that).

Go to top of page Go to top of page

Back to DOSBox Development

Main menu

Common searches