VOGONS


Dynamic core optimization

Topic actions

First post, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

I'm still blocked with the issue. Anyway, the point was to do some optimizations to the dynamic recompilation(risc_x86.h,) less bloated in cache and fewer instructions processed, see attached. Optimizations here would diminish cpu spikes, smooth things out, and help in possible thrashing corner cases. There are some spots in decoder.h that could be tightened up as well in the same way like the dyn_read/write_x, which get touched a lot.

Attachments

Reply 1 of 26, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Interesting, I would have assumed that the compiler did some of these on its own (given the inline and the greater picture)
but I checked the dynrec core x64 and noticed that for smaller functions it is "smart", but for the more complex things (gen_function_raw and such, which is inlined itself), it really starts doing one byte at the time and increase the pointer through a move, increase, move back operation.

Water flows down the stream
How to ask questions the smart way!

Reply 2 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Yeah, my intent was/is to do the gcc option of spitting out its assembly step to see the difference or not in the code it generates, know for sure at that point. Sounds like this is how you checked?

Reply 3 of 26, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

yeah, I used

objdump -Mintel -dS core_dynrec.o > test.asm 

But it is a bit messy to read due to the optimized code.
I could have used that gcc option to output it directly, but this is easier given that the object files are in my tree normally

Water flows down the stream
How to ask questions the smart way!

Reply 4 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Yeah, gcc asm output is cryptic but a before and after compare is enough, mostly, for me to follow along. Hopefully I'll work out this annoying permissions issue to play with this myself. Speaking of cryptic, I find some of the changes I did more readable/less spaghetti, shorter than the original, but maybe that's just me:)

Reply 5 of 26, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

It's easier to read with the -Mintel, but the interlinked source (the S) is sometimes a bit off.

Water flows down the stream
How to ask questions the smart way!

Reply 6 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Another tweak, can pull "if (!dsr2 && (ddr==dsr1) && !imm_size) return;" into the "if (!imm && (gsr1->index!=0x5))" path, no need to do the check for imm_size 1 & 4.

static void gen_lea(DynReg * ddr,DynReg * dsr1,DynReg * dsr2,Bitu scale,Bits imm) {
GenReg * gdr=FindDynReg(ddr);
Bitu imm_size;
Bit8u rm_base=(gdr->index << 3);
Bit8u index;
if (dsr1) {
GenReg * gsr1=FindDynReg(dsr1);
if (!imm && (gsr1->index!=0x5)) {
if (!dsr2 && (ddr==dsr1)) return;
imm_size=0; rm_base+=0x0; //no imm
} else if ((imm>=-128 && imm<=127)) {
imm_size=1;rm_base+=0x40; //Signed byte imm
} else {
imm_size=4;rm_base+=0x80; //Signed dword imm
}
index=gsr1->index;
} else {
imm_size=4;
index=5;
}
if (dsr2) {
GenReg * gsr2=FindDynReg(dsr2);
cache_addw(0x8d|(rm_base+0x4)<<8); //0x8d=LEA | The sib indicator
Bit8u sib=(index+(gsr2->index<<3)+(scale<<6));
cache_addb(sib);
} else {
cache_addw(0x8d|(rm_base+index)<<8); //LEA | dword imm
}
switch (imm_size) {
case 0: break;
case 1:cache_addb(imm);break;
case 4:cache_addd(imm);break;
}
ddr->flags|=DYNFLG_CHANGED;
}

Reply 8 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

I can compile now, took a reinstall, w7 was borked. Tweaks work, /w a touch up here and there, negligible performance change, though the binary is a K smaller and saved ~100k on mem usage(varies, just tracking /w task manager,) which I've been trading for inlining xyz. Need to get asm output going(objdump isn't working for me currently) and profiling to see what's going on.

Reply 9 of 26, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I'll be interested what you come up with.
I did something similar as you did for the dynrec core and it got a lot smaller indeed, but noticed no performance changes (which isn't too surprising as the asm that dosbox executes is unchanged)

Water flows down the stream
How to ask questions the smart way!

Reply 10 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Refresh my memory on invoking diff to output the correct format and I can upload what I have now, warts and all. I used the svn tar.gz from here: https://www.dosbox.com/wiki/Building_DOSBox_with_MinGW

Reply 11 of 26, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

extract the source a second time (folder name dosbox-org)
and then run in the folder that contains both yoursource and the original source

diff -u dosbox-org/src/cpu/core_dynamic/decoder.h yoursource/src/cpu/core_dynamic/decoder.h > mypatch.txt

Water flows down the stream
How to ask questions the smart way!

Reply 12 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Changes have been more than to just decoder.h, but confined to core_dyn_x86 dir.

Attachments

  • Filename
    mypatch.txt
    File size
    40.51 KiB
    Downloads
    42 downloads
    File license
    Fair use/fair dealing exception

Reply 13 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

did another tightening to the guys in decoder with these:

cache_addd(0x52|(0x50+genreg->index)<<8|0xe850<<16);
to
cache_addd(0xe8505052+(genreg->index<<8));

getting rid of two ors and a shift. chris's 3d bench liked it, it seems, 1001 vs 957. error of margin? like I said, I need to get asm output and profiling going.

Reply 14 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Reduced the binary by 8k now, mostly from changing ifs & switches /w repetitive function/method calls with local vars and calling once. Applied the optimized bound checking from the mixer to the mouse handler. The optimization in gen_call_function I had commented out working now.

Attachments

  • Filename
    my.diff
    File size
    117.62 KiB
    Downloads
    37 downloads
    File license
    Fair use/fair dealing exception

Reply 15 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

From the asm gcc is generating, they are faster, not just a size reduction of the binary. so yeah, fewer cache_addx is betta.

static void gen_return(BlockReturn retcode) {
gen_protectflags();
if (retcode==0) {
cache_addd(0xc3c03359); //POP ECX, the flags
// cache_addw(0xc033); //MOV EAX, 0
// cache_addb(0xc3); //RET
} else {
cache_addw(0xb859); //POP ECX, the flags
// cache_addb(0xb8); //MOV EAX, retcode
cache_addd(retcode);
cache_addb(0xc3); //RET
}
}

new:
__ZL10gen_return11BlockReturn.part.5:
movl __ZL5cache+16, %eax
movl $-1010814119, (%eax)
addl $4, %eax
movl %eax, __ZL5cache+16
ret
__ZL10gen_return11BlockReturn:
subl $4, %esp
cmpb $0, __ZL6x86gen
jne L247
L244: // -1xmov,2xlea
testl %eax, %eax
je L248
movl __ZL5cache+16, %edx
movl $-18343, %ecx
movl %eax, 2(%edx)
leal 7(%edx), %eax
movw %cx, (%edx)
movl %eax, __ZL5cache+16
movb $-61, 6(%edx)
addl $4, %esp
ret
L248: // +1xadd,1xjmp, -2xmov,1xlea,1xadd
addl $4, %esp
jmp __ZL10gen_return11BlockReturn.part.5
L247:
movl %eax, (%esp)
call __ZL16gen_protectflagsv.part.2
movl (%esp), %eax
jmp L244

old:
__ZL10gen_return11BlockReturn:
.cfi_startproc
subl $4, %esp
cmpb $0, __ZL6x86gen
jne L266
L262:
movl __ZL5cache+16, %edx
testl %eax, %eax
leal 1(%edx), %ecx
movl %ecx, __ZL5cache+16
movb $89, (%edx)
je L267
movl __ZL5cache+16, %edx
Show last 27 lines
	leal	1(%edx), %ecx
movl %ecx, __ZL5cache+16
movb $-72, (%edx)
movl __ZL5cache+16, %edx
movl %eax, (%edx)
leal 4(%edx), %eax
leal 1(%eax), %edx
movl %edx, __ZL5cache+16
movb $-61, (%eax)
addl $4, %esp
ret
L267:
movl __ZL5cache+16, %eax
movl $-16333, %edx
movw %dx, (%eax)
addl $2, %eax
leal 1(%eax), %edx
movl %edx, __ZL5cache+16
movb $-61, (%eax)
addl $4, %esp
ret
L266:
movl %eax, (%esp)
call __ZL16gen_protectflagsv.part.2
movl (%esp), %eax
jmp L262

Reply 17 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Added a cache_add3, which adds three bytes to the cache doing a dword move and inc the pointer by 3, replacing the addb + addw for those cases.

Attachments

  • Filename
    m3.diff
    File size
    157.91 KiB
    Downloads
    38 downloads
    File license
    Fair use/fair dealing exception

Reply 19 of 26, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Oh yeah? Well I've got cache_addq & cache_add7 working preliminarily and currently implementing the optimizations 😀

edit: Well, they work, but it seems like doom bench is getting slower as I add 64 bit moves.

Last edited by awgamer on 2018-08-19, 23:59. Edited 2 times in total.