The idea of those handlers is to avoid the penatly of odd memory accesses correct? If you consider what the penalties are and look at what the handlers do, it's apparent they're counter productive. The penalty for an odd addressed word acccess is an additional access. The penalty for a dword access with an address not divisible by four is also an additional access. If you take a look at the handlers, for example:
The odd dword addressing costs an extra access vs three additional: accesses, function calls, additions, moves, shifts, & ors, & one additional: and, compare, & jump.
There is the second penalty for unaligned memory accesses on some platforms -- crash. For that we have C_UNALIGNED_MEMORY. The code in question is agnostic and portable but you are right -- it is not very efficient. At least on targets that allow unaligned memory accesses. I don't expect compiler to be able to inline read/writeHandlers. So, the best solution would be to make this code to respect C_UNALIGNED_MEMORY and use the simpler one-line form if it is set to 1. It may not cause as great optimization as you might expect: most games tend to use aligned accesses anyway.
Um the dynamic core has its own handlers, and for the rest you're better
off changing the conditions to ((address & 0xfff)<0xffd for architectures
that allow little-endian unaligned fast access.
vasyl; yes, I'm aware of the preprocessor directives and already using such with my mem copy/r/w changes(big endian handlers devolve to byte accesses, so optimizations for them would just cause more overhead). I didn't give an expectation. I also noted the tendency for aligned accesses to wd in another thread, which is another reason why there's no benefit to checking for the condition. The overhead isn't all done inside all the unaligned handlers, as the setup cmp variable,cmp,jmp are done in the regular memory handlers(mem_(read/write)(w/d)_inline.).
"Um the dynamic core has its own handlers" Which disproves my assertion how, wd?
Not at all, but i don't see any reason to bother with functions that are
used 0.01% of the time (except for wolf3d of course, which, yet, runs at
5000 cycles) compared to other memory access functions.
Suppose with the current checks for unaligned/big endian memory access
all functions can use the <0xfff/0xffd style of page boundary checking
(see my first reply).
As I mentioned, the overhead doesn't just reside in the unaligned handlers themselves. Here's the gprof output for dawn patrol, which doesn't run at full speed on my system.
With almost all of the time spent by the functions themselves, not functions they call, and they're inlined. The most used checked functions called enough to run out the gprof call counter, which means minor improvements should yield bigger improvements overall.
Slightly off-topic: If you're doing serious profiling, you should definitely take a look at oprofile instead of gprof, a kernel-based profilder for Linux. With it, you can profile regular, unmodified optimized builds. Combined with gcc-4, which has great support for debugging information even in highly optimized builds, it is the most accurate way of obtaining a profile.
Back when I tried to optimize DOSBox for my old 333MHz laptop so that I could play X-Com with hq2x, it proved invaluable and showed me problems (like excessive cache misses) that gprof would never have indicated.
Most fail to correctly account the time spent in code generated by the
dynamic core anyways, can oprofile do that? Only one i found useful so
far was the amd analyzer (suppose the intel one would do it as well).
wd: oprofile can generate annotated assembler listings. Without some sort of debugging symbols, custom generated code can't be resolved to source code, however. GCC 4 has become very good at tracing back optimized and inlined code to the right source location, but dynamically generated code probably must be processed differently (pattern matching?).
If you're just after the dynamic code's runtime in general, then I think oprofile has the right tools to account for them. After all, you can use all of your CPUs performance counters (even multiple in parallel) on the binary.
Nevertheless, the stats ih8regs posted are valid, they show that the word
access functions suck a good part of the cpu time (even if the percentage
is lower due to the unaccounted time spent in generated code).
OK I see it, the address getting passed in the dyn_* functions. Valgrind might be able to track the dyn code, as it's supposed to be able to handle most self modifying code, or is it one one of those you tried that failed to account for the dyn time?
One question is how much of the mem_xxxxx_checked_x86 function's time is the unaligned overhead? To find out, here's the gprof line-by-line(line instead of function) profiling:
which is the end jump. There's two times for it, which I think is the time for the immediate jump of the fall through and exit jump after unaligned handling, so I'm thinking not.
Last edited by ih8registrations on 2007-08-28, 23:52. Edited 2 times in total.
Oprofile can tell you that, given a CPU with a corresponding performance counter register. (All recent CPUs probably can track these problems, just don't expect too much on a pentium 2/3 😉 )
Need to rewrite it so that it doesn't need to check(qemu doesn't require/do such.) I need better understanding of dosbox's dependency & use of it to figure out a rewrite.