Dynamic core optimization \ VOGONS

Dynamic core optimization

Topic actions

First post, by awgamer

Posted on 2018-03-29, 01:19

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

I'm still blocked with the issue. Anyway, the point was to do some optimizations to the dynamic recompilation(risc_x86.h,) less bloated in cache and fewer instructions processed, see attached. Optimizations here would diminish cpu spikes, smooth things out, and help in possible thrashing corner cases. There are some spots in decoder.h that could be tightened up as well in the same way like the dyn_read/write_x, which get touched a lot.

Reply 1 of 26, by Qbix

Posted on 2018-03-29, 13:51

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

Interesting, I would have assumed that the compiler did some of these on its own (given the inline and the greater picture)
but I checked the dynrec core x64 and noticed that for smaller functions it is "smart", but for the more complex things (gen_function_raw and such, which is inlined itself), it really starts doing one byte at the time and increase the pointer through a move, increase, move back operation.

Water flows down the stream
How to ask questions the smart way!

Reply 2 of 26, by awgamer

Posted on 2018-03-29, 14:34

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Yeah, my intent was/is to do the gcc option of spitting out its assembly step to see the difference or not in the code it generates, know for sure at that point. Sounds like this is how you checked?

Reply 3 of 26, by Qbix

Posted on 2018-03-29, 15:31

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

yeah, I used

1objdump -Mintel -dS core_dynrec.o > test.asm

But it is a bit messy to read due to the optimized code.
I could have used that gcc option to output it directly, but this is easier given that the object files are in my tree normally

Water flows down the stream
How to ask questions the smart way!

Reply 4 of 26, by awgamer

Posted on 2018-03-29, 16:19

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Yeah, gcc asm output is cryptic but a before and after compare is enough, mostly, for me to follow along. Hopefully I'll work out this annoying permissions issue to play with this myself. Speaking of cryptic, I find some of the changes I did more readable/less spaghetti, shorter than the original, but maybe that's just me:)

Reply 5 of 26, by Qbix

Posted on 2018-03-29, 16:41

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

It's easier to read with the -Mintel, but the interlinked source (the S) is sometimes a bit off.

Water flows down the stream
How to ask questions the smart way!

Reply 6 of 26, by awgamer

Posted on 2018-03-29, 17:24

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Another tweak, can pull "if (!dsr2 && (ddr==dsr1) && !imm_size) return;" into the "if (!imm && (gsr1->index!=0x5))" path, no need to do the check for imm_size 1 & 4.

1static void gen_lea(DynReg * ddr,DynReg * dsr1,DynReg * dsr2,Bitu scale,Bits imm) {
2	GenReg * gdr=FindDynReg(ddr);
3	Bitu imm_size;
4	Bit8u rm_base=(gdr->index << 3);
5	Bit8u index;
6	if (dsr1) {
7		GenReg * gsr1=FindDynReg(dsr1);
8		if (!imm && (gsr1->index!=0x5)) {
9         if (!dsr2 && (ddr==dsr1)) return;		
10			imm_size=0;	rm_base+=0x0;			//no imm				
11		} else if ((imm>=-128 && imm<=127)) {
12			imm_size=1;rm_base+=0x40;			//Signed byte imm
13		} else {
14			imm_size=4;rm_base+=0x80;			//Signed dword imm
15		}	
16		index=gsr1->index;    
17	} else {
18	  imm_size=4;
19	  index=5; 
20	}   
21	if (dsr2) {
22		GenReg * gsr2=FindDynReg(dsr2);			
23		cache_addw(0x8d|(rm_base+0x4)<<8);	//0x8d=LEA | The sib indicator
24		Bit8u sib=(index+(gsr2->index<<3)+(scale<<6));  
25		cache_addb(sib);			
26	} else {			
27		cache_addw(0x8d|(rm_base+index)<<8);	//LEA | dword imm			
28	}	
29	switch (imm_size) {
30	case 0:	break;
31	case 1:cache_addb(imm);break;
32	case 4:cache_addd(imm);break;
33	}
34	ddr->flags|=DYNFLG_CHANGED;
35}

Reply 7 of 26, by awgamer

Posted on 2018-03-29, 21:30

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

cinched up decoder.h

Reply 8 of 26, by awgamer

Posted on 2018-08-08, 15:12

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

I can compile now, took a reinstall, w7 was borked. Tweaks work, /w a touch up here and there, negligible performance change, though the binary is a K smaller and saved ~100k on mem usage(varies, just tracking /w task manager,) which I've been trading for inlining xyz. Need to get asm output going(objdump isn't working for me currently) and profiling to see what's going on.

Reply 9 of 26, by Qbix

Posted on 2018-08-08, 15:17

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

I'll be interested what you come up with.
I did something similar as you did for the dynrec core and it got a lot smaller indeed, but noticed no performance changes (which isn't too surprising as the asm that dosbox executes is unchanged)

Water flows down the stream
How to ask questions the smart way!

Reply 10 of 26, by awgamer

Posted on 2018-08-08, 15:26

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Refresh my memory on invoking diff to output the correct format and I can upload what I have now, warts and all. I used the svn tar.gz from here: https://www.dosbox.com/wiki/Building_DOSBox_with_MinGW

Reply 11 of 26, by Qbix

Posted on 2018-08-08, 15:49

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

extract the source a second time (folder name dosbox-org)
and then run in the folder that contains both yoursource and the original source

1diff -u dosbox-org/src/cpu/core_dynamic/decoder.h yoursource/src/cpu/core_dynamic/decoder.h > mypatch.txt

Water flows down the stream
How to ask questions the smart way!

Reply 12 of 26, by awgamer

Posted on 2018-08-08, 16:13

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Changes have been more than to just decoder.h, but confined to core_dyn_x86 dir.

Reply 13 of 26, by awgamer

Posted on 2018-08-08, 17:30

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

did another tightening to the guys in decoder with these:

cache_addd(0x52|(0x50+genreg->index)<<8|0xe850<<16);
to
cache_addd(0xe8505052+(genreg->index<<8));

getting rid of two ors and a shift. chris's 3d bench liked it, it seems, 1001 vs 957. error of margin? like I said, I need to get asm output and profiling going.

Reply 14 of 26, by awgamer

Posted on 2018-08-12, 19:55

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Reduced the binary by 8k now, mostly from changing ifs & switches /w repetitive function/method calls with local vars and calling once. Applied the optimized bound checking from the mixer to the mouse handler. The optimization in gen_call_function I had commented out working now.

Reply 15 of 26, by awgamer

Posted on 2018-08-14, 07:09

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

From the asm gcc is generating, they are faster, not just a size reduction of the binary. so yeah, fewer cache_addx is betta.

1static void gen_return(BlockReturn retcode) {
2	gen_protectflags();
3	if (retcode==0) { 
4		cache_addd(0xc3c03359);			//POP ECX, the flags
5//	  cache_addw(0xc033);		//MOV EAX, 0
6//	 	cache_addb(0xc3);			//RET
7	} else {
8		cache_addw(0xb859);			//POP ECX, the flags
9//		cache_addb(0xb8);		//MOV EAX, retcode
10		cache_addd(retcode);
11		cache_addb(0xc3);			//RET
12	}
13}
14
15new:
16__ZL10gen_return11BlockReturn.part.5: 
17  movl	__ZL5cache+16, %eax
18	movl	$-1010814119, (%eax)	
19	addl	$4, %eax
20	movl	%eax, __ZL5cache+16
21	ret
22__ZL10gen_return11BlockReturn:	
23	subl	$4, %esp
24	cmpb	$0, __ZL6x86gen
25	jne	L247
26L244:  // -1xmov,2xlea 
27	testl	%eax, %eax
28	je	L248
29	movl	__ZL5cache+16, %edx
30	movl	$-18343, %ecx
31	movl	%eax, 2(%edx)
32	leal	7(%edx), %eax
33	movw	%cx, (%edx)
34	movl	%eax, __ZL5cache+16
35	movb	$-61, 6(%edx)
36	addl	$4, %esp
37	ret
38L248:	// +1xadd,1xjmp, -2xmov,1xlea,1xadd
39	addl	$4, %esp
40	jmp	__ZL10gen_return11BlockReturn.part.5
41L247:	
42  movl	%eax, (%esp)
43	call	__ZL16gen_protectflagsv.part.2
44	movl	(%esp), %eax
45	jmp	L244
46
47old:
48__ZL10gen_return11BlockReturn:
49	.cfi_startproc
50	subl	$4, %esp
51	cmpb	$0, __ZL6x86gen
52	jne	L266
53L262:	
54	movl	__ZL5cache+16, %edx
55	testl	%eax, %eax
56	leal	1(%edx), %ecx
57	movl	%ecx, __ZL5cache+16
58	movb	$89, (%edx)
59	je	L267
60	movl	__ZL5cache+16, %edx

…Show last 27 lines

61	leal	1(%edx), %ecx
62	movl	%ecx, __ZL5cache+16
63	movb	$-72, (%edx)
64	movl	__ZL5cache+16, %edx
65	movl	%eax, (%edx)
66	leal	4(%edx), %eax
67	leal	1(%eax), %edx
68	movl	%edx, __ZL5cache+16
69	movb	$-61, (%eax)
70	addl	$4, %esp
71	ret
72L267:	
73	movl	__ZL5cache+16, %eax
74	movl	$-16333, %edx
75	movw	%dx, (%eax)
76	addl	$2, %eax
77	leal	1(%eax), %edx
78	movl	%edx, __ZL5cache+16
79	movb	$-61, (%eax)
80	addl	$4, %esp
81	ret
82L266:	
83	movl	%eax, (%esp)
84	call	__ZL16gen_protectflagsv.part.2
85	movl	(%esp), %eax
86	jmp	L262

Reply 16 of 26, by awgamer

Posted on 2018-08-15, 18:00

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

10k now.

Reply 17 of 26, by awgamer

Posted on 2018-08-19, 13:38

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Added a cache_add3, which adds three bytes to the cache doing a dword move and inc the pointer by 3, replacing the addb + addw for those cases.

Reply 18 of 26, by Qbix

Posted on 2018-08-19, 14:07

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

I did that in my own tree as well. Guess we had similar thoughts 😀

Water flows down the stream
How to ask questions the smart way!

Reply 19 of 26, by awgamer

Posted on 2018-08-19, 14:34

awgamer Offline

Rank Oldbie

Rank: Oldbie
Posts: 805
Joined: 2014-07-26, 07:42

Oh yeah? Well I've got cache_addq & cache_add7 working preliminarily and currently implementing the optimizations 😀

edit: Well, they work, but it seems like doom bench is getting slower as I add 64 bit moves.

Last edited by awgamer on 2018-08-19, 23:59. Edited 2 times in total.

Go to top of page Go to top of page

Back to DOSBox Development