I'm still blocked with the issue. Anyway, the point was to do some optimizations to the dynamic recompilation(risc_x86.h,) less bloated in cache and fewer instructions processed, see attached. Optimizations here would diminish cpu spikes, smooth things out, and help in possible thrashing corner cases. There are some spots in decoder.h that could be tightened up as well in the same way like the dyn_read/write_x, which get touched a lot.
Interesting, I would have assumed that the compiler did some of these on its own (given the inline and the greater picture)
but I checked the dynrec core x64 and noticed that for smaller functions it is "smart", but for the more complex things (gen_function_raw and such, which is inlined itself), it really starts doing one byte at the time and increase the pointer through a move, increase, move back operation.
Yeah, my intent was/is to do the gcc option of spitting out its assembly step to see the difference or not in the code it generates, know for sure at that point. Sounds like this is how you checked?
But it is a bit messy to read due to the optimized code.
I could have used that gcc option to output it directly, but this is easier given that the object files are in my tree normally
Yeah, gcc asm output is cryptic but a before and after compare is enough, mostly, for me to follow along. Hopefully I'll work out this annoying permissions issue to play with this myself. Speaking of cryptic, I find some of the changes I did more readable/less spaghetti, shorter than the original, but maybe that's just me:)
Another tweak, can pull "if (!dsr2 && (ddr==dsr1) && !imm_size) return;" into the "if (!imm && (gsr1->index!=0x5))" path, no need to do the check for imm_size 1 & 4.
I can compile now, took a reinstall, w7 was borked. Tweaks work, /w a touch up here and there, negligible performance change, though the binary is a K smaller and saved ~100k on mem usage(varies, just tracking /w task manager,) which I've been trading for inlining xyz. Need to get asm output going(objdump isn't working for me currently) and profiling to see what's going on.
I'll be interested what you come up with.
I did something similar as you did for the dynrec core and it got a lot smaller indeed, but noticed no performance changes (which isn't too surprising as the asm that dosbox executes is unchanged)
did another tightening to the guys in decoder with these:
cache_addd(0x52|(0x50+genreg->index)<<8|0xe850<<16);
to
cache_addd(0xe8505052+(genreg->index<<8));
getting rid of two ors and a shift. chris's 3d bench liked it, it seems, 1001 vs 957. error of margin? like I said, I need to get asm output and profiling going.
Reduced the binary by 8k now, mostly from changing ifs & switches /w repetitive function/method calls with local vars and calling once. Applied the optimized bound checking from the mixer to the mouse handler. The optimization in gen_call_function I had commented out working now.