Got gcov to work in msys. \ VOGONS

Got gcov to work in msys.

Topic actions

First post, by ih8registrations

Posted on 2009-07-27, 17:08

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Browsing the output, noticed in the while below, enabled doesn't appear to be changed unless it's possible by handler(todo)?, so the check could be pulled out of the loop with "if (enabled) while (needed>done) {.."

1function _ZN12MixerChannel3MixEj called 782296 returned 100% blocks executed 100%^M
2   782296:  146:void MixerChannel::Mix(Bitu _needed) {^M
3   782296:  147:        needed=_needed;^M
4  1126140:  148:        while (enabled && needed>done) {^M
5   343844:  149:                Bitu todo=needed-done;^M
6   343844:  150:                todo*=freq_add;^M
7   343844:  151:                if (todo & MIXER_REMAIN) {^M
8   141971:  152:                        todo=(todo >> MIXER_SHIFT) + 1;^M
9        -:  153:                } else {^M
10   201873:  154:                        todo=(todo >> MIXER_SHIFT);^M
11        -:  155:                }^M
12   343844:  156:                handler(todo);^M
13        -:  157:        }^M
14

To use gcov, export CFLAGS and CXXFLAGS with "-fprofile-arcs -ftest-coverage" with no optimization, then configure. Try to compile, it will bork on INLINEs, remove them(about 5-6 of them,) until it successfully compiles. Run dosbox and whatever game, then run "gcov mixer.cpp." The output will be in mixer.cpp.gcov, etc.

Reply 1 of 23, by HunterZ

Posted on 2009-07-27, 17:14

HunterZ Offline

Rank l33t++

Rank: l33t++
Posts: 6171
Joined: 2003-01-31, 19:04
Location: Seattle

On the other hand, if enabled does need to be checked and if (needed>done) changes more than enabled does, then their order within the while() statement should possibly be flipped around anyways for extra efficiency via short-circuit evaluation.

Reply 2 of 23, by ih8registrations

Posted on 2009-07-27, 17:33

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Enabled is checked, and it's an &&, no short circuit involved.

Can get rid of the inside if else too, which gcov shows oscillates between both conditions, and also save half a cache line in the process.

1// 784 KB (803,055 bytes) before
2// 784 KB (803,023 bytes) after
3void MixerChannel::Mix(Bitu _needed) {
4	needed=_needed;
5	if (enabled)
6	while (needed>done) {
7		Bitu todo=needed-done;
8		todo*=freq_add;
9		todo=(todo >> MIXER_SHIFT) + ((todo & MIXER_REMAIN)!=0);
10/*		
11		if (todo & MIXER_REMAIN) {
12			todo=(todo >> MIXER_SHIFT) + 1;
13		} else {
14			todo=(todo >> MIXER_SHIFT);
15		}
16*/		
17		handler(todo);
18	}
19}

gcov output:

1function _ZN12MixerChannel3MixEj called 585050 returned 100% blocks executed 100%^M
2   585050:  148:void MixerChannel::Mix(Bitu _needed) {^M
3   585050:  149:        needed=_needed;^M
4   585050:  150:        if (enabled)^M
5   574725:  151:        while (needed>done) {^M
6   287068:  152:                Bitu todo=needed-done;^M
7   287068:  153:                todo*=freq_add;^M
8   287068:  154:                todo=(todo >> MIXER_SHIFT) + ((todo & MIXER_REMAIN)!=0);^M
9        -:  155:/*              ^M
10        -:  156:                if (todo & MIXER_REMAIN) {^M
11        -:  157:                        todo=(todo >> MIXER_SHIFT) + 1;^M
12        -:  158:                } else {^M
13        -:  159:                        todo=(todo >> MIXER_SHIFT);^M
14        -:  160:                }^M
15        -:  161:*/              ^M
16   287068:  162:                handler(todo);^M
17        -:  163:        }^M
18        -:  164:}^M

Reply 3 of 23, by HunterZ

Posted on 2009-07-27, 18:29

HunterZ Offline

Rank l33t++

Rank: l33t++
Posts: 6171
Joined: 2003-01-31, 19:04
Location: Seattle

ih8registrations wrote:
Enabled is checked, and it's an &&, no short circuit involved.

Short-circuit evaluation is most certainly used for the && operator in C++. Specifically, if the first condition is false then the second is not evaluated.

Sounds like it doesn't matter here though after all.

Reply 4 of 23, by ih8registrations

Posted on 2009-07-27, 21:14

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

I misspoke. Yes, it doesn't matter here.

1331812562:  451:bool PIC_RunQueue(void) {
2        -:  452:	/* Check to see if a new milisecond needs to be started */
3331812562:  453:	CPU_CycleLeft+=CPU_Cycles;
4331812562:  454:	CPU_Cycles=0;
5331812562:  455:	if (CPU_CycleLeft<=0) {
6   333454:  456:		return false;
7        -:  457:	}
8        -:  458:	/* Check the queue for an entry */
9331479108:  459:	Bits index_nd=PIC_TickIndexND();
10331693760:  460:	while (pic_queue.next_entry && (pic_queue.next_entry->index*CPU_CycleMax<=index_nd)) {
11   214652:  461:		PICEntry * entry=pic_queue.next_entry;
12   214652:  462:		pic_queue.next_entry=entry->next;
13   214652:  463:		(entry->pic_event)(entry->value);
14        -:  464:		/* Put the entry in the free list */
15   214652:  465:		entry->next=pic_queue.free_entry;
16   214652:  466:		pic_queue.free_entry=entry;
17        -:  467:	}
18        -:  468:	/* Check when to set the new cycle end */
19331479108:  469:	if (pic_queue.next_entry) {
20331479108:  470:		Bits cycles=(Bits)(pic_queue.next_entry->index*CPU_CycleMax-index_nd);
21331479108:  471:		if (!cycles) cycles=1;
22331479108:  472:		if (cycles<CPU_CycleLeft) {
23 93038444:  473:			CPU_Cycles=cycles;
24        -:  474:		} else {
25238440664:  475:			CPU_Cycles=CPU_CycleLeft;
26        -:  476:		}
27    #####:  477:	} else CPU_Cycles=CPU_CycleLeft;
28331479108:  478:	CPU_CycleLeft-=CPU_Cycles;
29331479108:  479:	if 	(PIC_IRQCheck)	PIC_runIRQs();
30331479108:  480:	return true;
31        -:  481:}

This:

1331812562:  455:	if (CPU_CycleLeft<=0) {
2   333454:  456:		return false;

and this:

1331479108:  472:		if (cycles<CPU_CycleLeft) {
2 93038444:  473:			CPU_Cycles=cycles;
3        -:  474:		} else {
4238440664:  475:			CPU_Cycles=CPU_CycleLeft;
5        -:  476:		}

could use GCC_UNLIKELY, though the second case could use more testing to see if the scenario flips.

Reply 5 of 23, by wd

Posted on 2009-07-27, 21:28

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

This code is entered rarely compared to the cpu emulation.

Reply 6 of 23, by Qbix

Posted on 2009-07-28, 06:31

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

It are relatively safe optimizations. I have no problems with a few extra GCC_UNLIKELY somewhere. and mixer mix optimizations don't hurt either. It's not where the big time is spend, but it is called quite often.

Water flows down the stream
How to ask questions the smart way!

Reply 7 of 23, by wd

Posted on 2009-07-28, 08:40

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Then add it, it won't do any harm.

Reply 8 of 23, by Qbix

Posted on 2009-07-28, 09:40

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

i think the pcspeaker callback could turn off a channel technically speaking and maybe adlib as well.
so that change with enabled might not be entirely correct. However at least the pcspeaker should work fine being a called a few times while disabled

Water flows down the stream
How to ask questions the smart way!

Reply 9 of 23, by ih8registrations

Posted on 2009-07-28, 10:50

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Meh, I'd leave (enabled && needed>done) if it's going to muck up behavior bad enough, it's just at first glance it doesn't look like that would be.

@wd yeah, mix is pretty far down the list, it was just the first thing I ran into and my penchant for optimizing just because, though gprof shows there's often not a localized bottleneck but load being distributed such that the only thing left to get improvement without rewriting things with better methods(like doing 32/64 bit r/w for byte and word, or optimizations to the generated code rather than straight translations, like converting movsb to movsd) is to shore up everything.

Reply 10 of 23, by wd

Posted on 2009-07-28, 12:21

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Some work on the threading ideas would (imo) be more interesting even
if the results may not be useful (but hopefully insightful).

Reply 11 of 23, by ih8registrations

Posted on 2009-07-28, 13:42

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

If you're interested in threading, this guy is a candidate, though a little fine grained:

1function _ZN15CodePageHandler14FindCacheBlockEj called 408253959 returned 100% blocks executed 100%
2408253959:  313:	CacheBlock * FindCacheBlock(Bitu start) {
3408253959:  314:		CacheBlock * block=hash_map[1+(start>>DYN_HASH_SHIFT)];
4965482041:  315:		while (block) {
5964297390:  316:			if (block->page.start==start) return block;
6557228082:  317:			block=block->hash.next;
7        -:  318:		}
8  1184651:  319:		return 0;
9        -:  320:	}

Reply 12 of 23, by ih8registrations

Posted on 2009-07-28, 15:54

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Here's a size optimized finddynreg. Besides being 10 bytes smaller overall, the big nut is no replication of the for loop, which should improve cache thrashing(gcov shows both loops being liberally used.) I haven't measured the cycles usage difference yet though.

1//194 KB (198,811 bytes)
2//194 KB (198,801 bytes)
3static GenReg * FindDynReg(DynReg * dynreg,bool stale=false) {
4	x86gen.last_used++;
5	if (dynreg->genreg) {
6		dynreg->genreg->last_used=x86gen.last_used;
7		return dynreg->genreg;
8	}
9	/* Find best match for selected global reg */
10	Bits i;
11	Bits first_used,first_index;
12	first_used=-1;	
13	Bits terminate;
14	Bits increment;
15	if (dynreg->flags & DYNFLG_HAS8) {	
16	  i=first_index=0;
17	  terminate=X86_REG_EBX+1;
18	  increment=1;
19	} else {	
20	  i=first_index=X86_REGS-1;
21	  terminate=-1;
22	  increment=-1;
23	}
24	 while (i!=terminate) {
25			GenReg * genreg=x86gen.regs[i];
26			if (genreg->notusable) { i+=increment; continue;}
27			if (!(genreg->dynreg)) {
28				genreg->Load(dynreg,stale);
29				return genreg;
30			}
31			if (genreg->last_used<(Bitu)first_used) {
32				first_used=genreg->last_used;
33				first_index=i;
34			}
35			i+=increment;		
36	}		
37	/*
38	if (dynreg->flags & DYNFLG_HAS8) {
39		// Has to be eax,ebx,ecx,edx 
40		for (i=first_index=0;i<=X86_REG_EBX;i++) {
41			GenReg * genreg=x86gen.regs[i];
42			if (genreg->notusable) continue;
43			if (!(genreg->dynreg)) {
44				genreg->Load(dynreg,stale);
45				return genreg;
46			}
47			if (genreg->last_used<(Bitu)first_used) {
48				first_used=genreg->last_used;
49				first_index=i;
50			}
51		}
52	} else {
53		for (i=first_index=X86_REGS-1;i>=0;i--) {
54			GenReg * genreg=x86gen.regs[i];
55			if (genreg->notusable) continue;
56			if (!(genreg->dynreg)) {
57				genreg->Load(dynreg,stale);
58				return genreg;
59			}
60			if (genreg->last_used<(Bitu)first_used) {

…Show last 12 lines

61				first_used=genreg->last_used;
62				first_index=i;
63			}
64		}
65	}
66	*/
67	/* No free register found use earliest assigned one */
68	GenReg * newreg=x86gen.regs[first_index];
69	newreg->Load(dynreg,stale);
70	return newreg;
71}

Reply 13 of 23, by wd

Posted on 2009-07-28, 16:50

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

I'd expect gcov to not take into account the time spent in the recompiled code,
so it's missing the by far biggest part.
Nevertheless, the FindDynReg looks ok, "threading" FindCacheBlock sounds
useless though (maybe some better algorithm for the block finding, but don't
know how bad the hashing is).

Reply 14 of 23, by ih8registrations

Posted on 2009-07-28, 18:13

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

Even though I run dynamic core all the time, lots of normal core running. These small changes to the inlined fetches give a big drop, 6574 bytes.

1core_normal.cpp:
2
3501065914:  110:static INLINE Bit8u Fetchb() {
4250532957:  111:	Bit8u temp=LoadMb(core.cseip);
5250532957:  112:	core.cseip+=1;
6        -:  113:	return temp;
7
8
9//306 KB (313,408 bytes)
10//299 KB (306,834 bytes)
11static INLINE Bit8u Fetchb() {
12	core.cseip++;
13//	Bit8u temp=
14  return LoadMb(core.cseip-1);
15//	core.cseip+=1;
16//	return temp;
17}
18
19static INLINE Bit16u Fetchw() {
20    core.cseip+=2;
21//	Bit16u temp=
22    return LoadMw(core.cseip-2);
23//	core.cseip+=2;
24//	return temp;
25}
26static INLINE Bit32u Fetchd() {
27	core.cseip+=4;
28//	Bit32u temp=
29  return LoadMd(core.cseip-4);
30//	core.cseip+=4;
31//	return temp;
32}

Reply 15 of 23, by wd

Posted on 2009-07-28, 20:12

wd Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 10813
Joined: 2003-12-03, 21:23

Well you should check then *why* the normal core is being called. Usually this
happens for example with code that has a high degree of self-modification
of instructions (the "opcode" part, not the data of the instruction). So there
may be improvement on that part.

Reply 16 of 23, by Qbix

Posted on 2009-09-05, 11:19

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

for those fetchb/w/d and such it might interresting to look at the generated asm as well

Water flows down the stream
How to ask questions the smart way!

Reply 17 of 23, by kekko

Posted on 2009-09-09, 20:01

kekko Offline

Rank Oldbie

Rank: Oldbie
Posts: 501
Joined: 2004-03-24, 18:56

Qbix wrote:
for those fetchb/w/d and such it might interresting to look at the generated asm as well

I'm quite interested too, for fetch functions as well as other read/write memory functions... 😜

Reply 18 of 23, by Qbix

Posted on 2009-09-09, 20:24

Qbix Offline

Rank DOSBox Author

Rank: DOSBox Author
Posts: 11324
Joined: 2002-11-27, 14:50
Location: Fryslan

Maybe this would be even shorter

1return LoadMb(core.cseip++);

Water flows down the stream
How to ask questions the smart way!

Reply 19 of 23, by ih8registrations

Posted on 2009-09-10, 01:52

ih8registrations Offline

Rank Oldbie

Rank: Oldbie
Posts: 931
Joined: 2003-07-25, 17:20

It is, very much so.

//940 KB (963,360 bytes)
//908 KB (930,468 bytes)

even for fetchw & fetchd

//908 KB (930,468 bytes)
//897 KB (918,944 bytes)
static INLINE Bit16u Fetchw() {
// core.cseip+=2;
// Bit16u temp=
return LoadMw(core.cseip+=2);
// core.cseip+=2;
// return temp;
}
//897 KB (918,944 bytes)
//890 KB (912,060 bytes)
static INLINE Bit32u Fetchd() {
// core.cseip+=4;
// Bit32u temp=
return LoadMd(core.cseip+=4);
// core.cseip+=4;
// return temp;
}

Need to bench and make sure everything still works.

Go to top of page Go to top of page

Back to DOSBox Development