VOGONS


unaligned handlers(all of them.)

Topic actions

First post, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

The idea of those handlers is to avoid the penatly of odd memory accesses correct? If you consider what the penalties are and look at what the handlers do, it's apparent they're counter productive. The penalty for an odd addressed word acccess is an additional access. The penalty for a dword access with an address not divisible by four is also an additional access. If you take a look at the handlers, for example:

if ( addr & 1) {
*val =
(readHandler<Bit8u>( addr+0 ) << 0 ) |
(readHandler<Bit8u>( addr+1 ) << 8 );
} else {
*val = host_readw( &vga.mem.linear[((addr&~3)<<2)+(addr&3)] );
}

It still does two memory accesses, but also an additional function call, move, addition, shift, or, and, & jump.

The difference is even more so with the dword handling:

		if ( addr & 3) {
*val =
(readHandler<Bit8u>( addr+0 ) << 0 ) |
(readHandler<Bit8u>( addr+1 ) << 8 ) |
(readHandler<Bit8u>( addr+2 ) << 16 ) |
(readHandler<Bit8u>( addr+3 ) << 24 );
} else {
*val = host_readd( &vga.mem.linear[((addr&~3)<<2)+(addr&3)] );
}

The odd dword addressing costs an extra access vs three additional: accesses, function calls, additions, moves, shifts, & ors, & one additional: and, compare, & jump.

Reply 1 of 23, by vasyl

User metadata
Rank Oldbie
Rank
Oldbie

There is the second penalty for unaligned memory accesses on some platforms -- crash. For that we have C_UNALIGNED_MEMORY. The code in question is agnostic and portable but you are right -- it is not very efficient. At least on targets that allow unaligned memory accesses. I don't expect compiler to be able to inline read/writeHandlers. So, the best solution would be to make this code to respect C_UNALIGNED_MEMORY and use the simpler one-line form if it is set to 1. It may not cause as great optimization as you might expect: most games tend to use aligned accesses anyway.

Reply 2 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Um the dynamic core has its own handlers, and for the rest you're better
off changing the conditions to ((address & 0xfff)<0xffd for architectures
that allow little-endian unaligned fast access.

Reply 3 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

vasyl; yes, I'm aware of the preprocessor directives and already using such with my mem copy/r/w changes(big endian handlers devolve to byte accesses, so optimizations for them would just cause more overhead). I didn't give an expectation. I also noted the tendency for aligned accesses to wd in another thread, which is another reason why there's no benefit to checking for the condition. The overhead isn't all done inside all the unaligned handlers, as the setup cmp variable,cmp,jmp are done in the regular memory handlers(mem_(read/write)(w/d)_inline.).

"Um the dynamic core has its own handlers" Which disproves my assertion how, wd?

Reply 4 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Which disproves my assertion how, wd?

Not at all, but i don't see any reason to bother with functions that are
used 0.01% of the time (except for wolf3d of course, which, yet, runs at
5000 cycles) compared to other memory access functions.

Suppose with the current checks for unaligned/big endian memory access
all functions can use the <0xfff/0xffd style of page boundary checking
(see my first reply).

Reply 5 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

As I mentioned, the overhead doesn't just reside in the unaligned handlers themselves. Here's the gprof output for dawn patrol, which doesn't run at full speed on my system.

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
17.96 18.80 18.80 23740620 0.00 0.00 CPU_Core_Dyn_X86_Run()
16.72 36.30 17.50 mem_readw_checked_x86(unsigned int, unsigned short*)
13.28 50.20 13.90 mem_writew_checked_x86(unsigned int, unsigned short)
10.52 61.21 11.01 383891 0.00 0.00 THEOPL3::YMF262UpdateOne(int, short*, int)
7.12 68.66 7.45 10083479 0.00 0.00 RENDER_StartLineHandler(void const*)
4.78 73.66 5.00 417184353 0.00 0.00 MakeCodePage(unsigned int, CodePageHandler*&)
3.65 77.48 3.82 37058324 0.00 0.00 PAGING_LinkPage(unsigned int, unsigned int)
1.78 79.34 1.86 7673278 0.00 0.00 PAGING_ClearTLB()
1.43 80.84 1.50 792893 0.00 0.00 Normal1x_8_32_R(void const*)
1.39 82.29 1.45 152567172 0.00 0.00 mem_readw(unsigned int)
1.15 83.49 1.20 30508656 0.00 0.00 PAGING_MapPage(unsigned int, unsigned int)
0.99 84.53 1.04 CPU_SetSegGeneral(SegNames, unsigned int)
0.98 85.56 1.03 7592004 0.00 0.00 MEM_NextHandleAt(int, unsigned int)
0.93 86.53 0.97 __moddi3
0.90 87.47 0.94 24508508 0.00 0.00 PIC_RunQueue()
0.90 88.41 0.94 20840779 0.00 0.00 vga_read_p3da(unsigned int, unsigned int)
0.71 89.15 0.74 __divdi3
0.69 89.87 0.72 32251920 0.00 0.00 CodePageHandler::writew_checked(unsigned int, unsigned int)
0.64 90.54 0.67 31989488 0.00 0.00 mem_readb(unsigned int)
0.64 91.21 0.67 dyn_helper_divw(unsigned short)
0.64 91.88 0.67 dyn_helper_idivw(short)
0.59 92.50 0.62 CPU_RET(bool, unsigned int, unsigned int)
0.54 93.06 0.56 CPU_CALL(bool, unsigned int, unsigned int, unsigned int)
0.51 93.59 0.53 mem_writed_checked_x86(unsigned int, unsigned int)
0.49 94.10 0.51 55467854 0.00 0.00 mem_writew(unsigned int, unsigned short)
0.49 94.61 0.51 mem_writeb_checked_x86(unsigned int, unsigned char)
0.46 95.09 0.48 7518593 0.00 0.00 EMM_MapSegment(unsigned int, unsigned short, unsigned short)
0.44 95.55 0.46 21190280 0.00 0.00 CodePageHandler::GetHostReadPt(unsigned int)
0.41 95.98 0.43 5710069 0.00 0.00 CPU_Core_Normal_Run()
0.40 96.40 0.42 16353107 0.00 0.00 INT16_Handler()
0.40 96.82 0.42 mem_readd_checked_x86(unsigned int, unsigned int*)

31% of the total consist of mem_xxxxx_checked_x86 calls.

[9]     17.8   17.50    1.15                 mem_readw_checked_x86(unsigned int, unsigned short*) [9]
0.16 0.99 6367002/6367002 InitPageHandler::readw_checked(unsigned int, unsigned int*) [39]
0.00 0.00 6316/6316 mem_unalignedreadw_checked_x86(unsigned int, unsigned short*)

With almost all of the time spent by the functions themselves, not functions they call, and they're inlined. The most used checked functions called enough to run out the gprof call counter, which means minor improvements should yield bigger improvements overall.

INLINE bool mem_readw_checked_x86(PhysPt address, Bit16u * val) {
if ((address & 0xfff)<0xfff) {
Bitu index=(address>>12);
if (paging.tlb.read[index]) {
*val=host_readw(paging.tlb.read[index]+address);
return false;
} else {
Bitu uval;
bool retval;
retval=paging.tlb.handler[index]->readw_checked(address, &uval);
*val=(Bit16u)uval;
return retval;
}
} else return mem_unalignedreadw_checked_x86(address, val);
}

As you can see, there's unaligned handling overhead in these big-o routines.

Reply 7 of 23, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Slightly off-topic: If you're doing serious profiling, you should definitely take a look at oprofile instead of gprof, a kernel-based profilder for Linux. With it, you can profile regular, unmodified optimized builds. Combined with gcc-4, which has great support for debugging information even in highly optimized builds, it is the most accurate way of obtaining a profile.

Back when I tried to optimize DOSBox for my old 333MHz laptop so that I could play X-Com with hq2x, it proved invaluable and showed me problems (like excessive cache misses) that gprof would never have indicated.

Reply 8 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Most fail to correctly account the time spent in code generated by the
dynamic core anyways, can oprofile do that? Only one i found useful so
far was the amd analyzer (suppose the intel one would do it as well).

Reply 11 of 23, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

wd: oprofile can generate annotated assembler listings. Without some sort of debugging symbols, custom generated code can't be resolved to source code, however. GCC 4 has become very good at tracing back optimized and inlined code to the right source location, but dynamically generated code probably must be processed differently (pattern matching?).

If you're just after the dynamic code's runtime in general, then I think oprofile has the right tools to account for them. After all, you can use all of your CPUs performance counters (even multiple in parallel) on the binary.

Reply 12 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Nevertheless, the stats ih8regs posted are valid, they show that the word
access functions suck a good part of the cpu time (even if the percentage
is lower due to the unaccounted time spent in generated code).

Reply 13 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

OK I see it, the address getting passed in the dyn_* functions. Valgrind might be able to track the dyn code, as it's supposed to be able to handle most self modifying code, or is it one one of those you tried that failed to account for the dyn time?

Reply 14 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

One question is how much of the mem_xxxxx_checked_x86 function's time is the unaligned overhead? To find out, here's the gprof line-by-line(line instead of function) profiling:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
15.81 6.74 6.74 mem_readw_checked_x86(unsigned int, unsigned short*)
5.02 8.88 2.14 Z23RENDER_StartLineHandlerPKv (render.cpp:105 @ 4ca702)
4.48 10.79 1.91 mem_writew_checked_x86(unsigned int, unsigned short) (paging.h:321 @ 5592e0)
4.22 12.59 1.80 Z20CPU_Core_Dyn_X86_Runv (cache.h:272 @ 423271)
4.10 14.34 1.75 Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:141 @ 4232a4)
2.13 15.25 0.91 Z22mem_writew_checked_x86jt (paging.h:324 @ 559311)
1.59 15.93 0.68 ZN7THEOPL315YMF262UpdateOneEiPsi (ymf262.c:693 @ 4bf185)
1.36 16.51 0.58 174113113 0.00 0.00 MakeCodePage(unsigned int, CodePageHandler*&) (decoder.h:53 @ 40c4b0)
1.20 17.02 0.51 Z22mem_writew_checked_x86jt (paging.h:323 @ 55930c)
1.15 17.51 0.49 Z20CPU_Core_Dyn_X86_Runv (cache.h:269 @ 4232e6)
1.15 18.00 0.49 Z20CPU_Core_Dyn_X86_Runv (core_dyn_x86.cpp:280 @ 4232e4)
1.03 18.44 0.44 Z22mem_writew_checked_x86jt (paging.h:322 @ 559300)
0.94 18.84 0.40 Z12MakeCodePagejRP15CodePageHandler (decoder.h:96 @ 40c519)
0.90 19.23 0.39 Z20CPU_Core_Dyn_X86_Runv (cache.h:270 @ 423263)
0.87 19.59 0.37 106855093 0.00 0.00 mem_readw(unsigned int) (memory.cpp:546 @ 4b4790)
0.83 19.95 0.35 Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:142 @ 4232b4)
0.82 20.30 0.35 Z15PAGING_LinkPagejj (paging.cpp:359 @ 40a1cd)
0.82 20.65 0.35 Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:130 @ 423290)
0.82 21.00 0.35 __moddi3
0.81 21.34 0.34 Z15PAGING_LinkPagejj (paging.cpp:365 @ 40a1fc)
0.80 21.68 0.34 Z20CPU_Core_Dyn_X86_Runv (core_dyn_x86.cpp:297 @ 423297)
0.76 22.01 0.33 Z20CPU_Core_Dyn_X86_Runv (risc_x86.h:142 @ 4232ca)
0.73 22.32 0.31 Z15dyn_helper_divwt (helpers.h:26 @ 41018e)
0.73 22.63 0.31 Z22mem_writew_checked_x86jt (paging.h:329 @ 559325)
0.70 22.93 0.30 Z15Normal1x_8_32_RPKv (render_simple.h:55 @ 4df525)
0.70 23.23 0.30 Z20CPU_Core_Dyn_X86_Runv (core_dyn_x86.cpp:280 @ 423285)
0.69 23.52 0.29 Z22mem_writew_checked_x86jt (paging.h:329 @ 559320)
0.69 23.82 0.29 Z22mem_writew_checked_x86jt (paging.h:326 @ 559323)
0.66 24.10 0.28 Z20CPU_Core_Dyn_X86_Runv (cache.h:273 @ 423278)

Using mem_writew_checked_x86 to determine that; adding all of its line times(not all shown), is 5 seconds.

1.03 18.44 0.44 Z22mem_writew_checked_x86jt (paging.h:322 @ 559300)

is "if ((address & 0xfff)<0xfff) {."

.44/5 or 8.8% of the total function time. 8.8% of 31% total time is ~2.7% of total runtime being the unaligned overhead, when unaligned isn't handled.

I'm not sure whether I should add:

0.69 23.52 0.29 Z22mem_writew_checked_x86jt (paging.h:329 @ 559320)

which is the end jump. There's two times for it, which I think is the time for the immediate jump of the fall through and exit jump after unaligned handling, so I'm thinking not.

Last edited by ih8registrations on 2007-08-28, 23:52. Edited 2 times in total.

Reply 15 of 23, by `Moe`

User metadata
Rank Oldbie
Rank
Oldbie

Oprofile can tell you that, given a CPU with a corresponding performance counter register. (All recent CPUs probably can track these problems, just don't expect too much on a pentium 2/3 😉 )

Reply 17 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

.44/5 or 8.8% of the total function time. 8.8% of 31% total time is
~2.7% of total runtime being the unaligned overhead, when unaligned
isn't handled.

Well but can't remove them as they care about page crossings.
Dunno if there's a faster way to check for that.

Reply 18 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

Need to rewrite it so that it doesn't need to check(qemu doesn't require/do such.) I need better understanding of dosbox's dependency & use of it to figure out a rewrite.