VOGONS


Got gcov to work in msys.

Topic actions

First post, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

Browsing the output, noticed in the while below, enabled doesn't appear to be changed unless it's possible by handler(todo)?, so the check could be pulled out of the loop with "if (enabled) while (needed>done) {.."

function _ZN12MixerChannel3MixEj called 782296 returned 100% blocks executed 100%^M
782296: 146:void MixerChannel::Mix(Bitu _needed) {^M
782296: 147: needed=_needed;^M
1126140: 148: while (enabled && needed>done) {^M
343844: 149: Bitu todo=needed-done;^M
343844: 150: todo*=freq_add;^M
343844: 151: if (todo & MIXER_REMAIN) {^M
141971: 152: todo=(todo >> MIXER_SHIFT) + 1;^M
-: 153: } else {^M
201873: 154: todo=(todo >> MIXER_SHIFT);^M
-: 155: }^M
343844: 156: handler(todo);^M
-: 157: }^M

To use gcov, export CFLAGS and CXXFLAGS with "-fprofile-arcs -ftest-coverage" with no optimization, then configure. Try to compile, it will bork on INLINEs, remove them(about 5-6 of them,) until it successfully compiles. Run dosbox and whatever game, then run "gcov mixer.cpp." The output will be in mixer.cpp.gcov, etc.

Reply 1 of 23, by HunterZ

User metadata
Rank l33t++
Rank
l33t++

On the other hand, if enabled does need to be checked and if (needed>done) changes more than enabled does, then their order within the while() statement should possibly be flipped around anyways for extra efficiency via short-circuit evaluation.

Reply 2 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

Enabled is checked, and it's an &&, no short circuit involved.

Can get rid of the inside if else too, which gcov shows oscillates between both conditions, and also save half a cache line in the process.

// 784 KB (803,055 bytes) before
// 784 KB (803,023 bytes) after
void MixerChannel::Mix(Bitu _needed) {
needed=_needed;
if (enabled)
while (needed>done) {
Bitu todo=needed-done;
todo*=freq_add;
todo=(todo >> MIXER_SHIFT) + ((todo & MIXER_REMAIN)!=0);
/*
if (todo & MIXER_REMAIN) {
todo=(todo >> MIXER_SHIFT) + 1;
} else {
todo=(todo >> MIXER_SHIFT);
}
*/
handler(todo);
}
}

gcov output:

function _ZN12MixerChannel3MixEj called 585050 returned 100% blocks executed 100%^M
585050: 148:void MixerChannel::Mix(Bitu _needed) {^M
585050: 149: needed=_needed;^M
585050: 150: if (enabled)^M
574725: 151: while (needed>done) {^M
287068: 152: Bitu todo=needed-done;^M
287068: 153: todo*=freq_add;^M
287068: 154: todo=(todo >> MIXER_SHIFT) + ((todo & MIXER_REMAIN)!=0);^M
-: 155:/* ^M
-: 156: if (todo & MIXER_REMAIN) {^M
-: 157: todo=(todo >> MIXER_SHIFT) + 1;^M
-: 158: } else {^M
-: 159: todo=(todo >> MIXER_SHIFT);^M
-: 160: }^M
-: 161:*/ ^M
287068: 162: handler(todo);^M
-: 163: }^M
-: 164:}^M

Reply 3 of 23, by HunterZ

User metadata
Rank l33t++
Rank
l33t++
ih8registrations wrote:

Enabled is checked, and it's an &&, no short circuit involved.

Short-circuit evaluation is most certainly used for the && operator in C++. Specifically, if the first condition is false then the second is not evaluated.

Sounds like it doesn't matter here though after all.

Reply 4 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

I misspoke. Yes, it doesn't matter here.

331812562:  451:bool PIC_RunQueue(void) {
-: 452: /* Check to see if a new milisecond needs to be started */
331812562: 453: CPU_CycleLeft+=CPU_Cycles;
331812562: 454: CPU_Cycles=0;
331812562: 455: if (CPU_CycleLeft<=0) {
333454: 456: return false;
-: 457: }
-: 458: /* Check the queue for an entry */
331479108: 459: Bits index_nd=PIC_TickIndexND();
331693760: 460: while (pic_queue.next_entry && (pic_queue.next_entry->index*CPU_CycleMax<=index_nd)) {
214652: 461: PICEntry * entry=pic_queue.next_entry;
214652: 462: pic_queue.next_entry=entry->next;
214652: 463: (entry->pic_event)(entry->value);
-: 464: /* Put the entry in the free list */
214652: 465: entry->next=pic_queue.free_entry;
214652: 466: pic_queue.free_entry=entry;
-: 467: }
-: 468: /* Check when to set the new cycle end */
331479108: 469: if (pic_queue.next_entry) {
331479108: 470: Bits cycles=(Bits)(pic_queue.next_entry->index*CPU_CycleMax-index_nd);
331479108: 471: if (!cycles) cycles=1;
331479108: 472: if (cycles<CPU_CycleLeft) {
93038444: 473: CPU_Cycles=cycles;
-: 474: } else {
238440664: 475: CPU_Cycles=CPU_CycleLeft;
-: 476: }
#####: 477: } else CPU_Cycles=CPU_CycleLeft;
331479108: 478: CPU_CycleLeft-=CPU_Cycles;
331479108: 479: if (PIC_IRQCheck) PIC_runIRQs();
331479108: 480: return true;
-: 481:}

This:

331812562:  455:	if (CPU_CycleLeft<=0) {
333454: 456: return false;

and this:

331479108:  472:		if (cycles<CPU_CycleLeft) {
93038444: 473: CPU_Cycles=cycles;
-: 474: } else {
238440664: 475: CPU_Cycles=CPU_CycleLeft;
-: 476: }

could use GCC_UNLIKELY, though the second case could use more testing to see if the scenario flips.

Reply 5 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

This code is entered rarely compared to the cpu emulation.

Reply 6 of 23, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

It are relatively safe optimizations. I have no problems with a few extra GCC_UNLIKELY somewhere. and mixer mix optimizations don't hurt either. It's not where the big time is spend, but it is called quite often.

Water flows down the stream
How to ask questions the smart way!

Reply 7 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Then add it, it won't do any harm.

Reply 8 of 23, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

i think the pcspeaker callback could turn off a channel technically speaking and maybe adlib as well.
so that change with enabled might not be entirely correct. However at least the pcspeaker should work fine being a called a few times while disabled

Water flows down the stream
How to ask questions the smart way!

Reply 9 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

Meh, I'd leave (enabled && needed>done) if it's going to muck up behavior bad enough, it's just at first glance it doesn't look like that would be.

@wd yeah, mix is pretty far down the list, it was just the first thing I ran into and my penchant for optimizing just because, though gprof shows there's often not a localized bottleneck but load being distributed such that the only thing left to get improvement without rewriting things with better methods(like doing 32/64 bit r/w for byte and word, or optimizations to the generated code rather than straight translations, like converting movsb to movsd) is to shore up everything.

Reply 10 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Some work on the threading ideas would (imo) be more interesting even
if the results may not be useful (but hopefully insightful).

Reply 11 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

If you're interested in threading, this guy is a candidate, though a little fine grained:

function _ZN15CodePageHandler14FindCacheBlockEj called 408253959 returned 100% blocks executed 100%
408253959: 313: CacheBlock * FindCacheBlock(Bitu start) {
408253959: 314: CacheBlock * block=hash_map[1+(start>>DYN_HASH_SHIFT)];
965482041: 315: while (block) {
964297390: 316: if (block->page.start==start) return block;
557228082: 317: block=block->hash.next;
-: 318: }
1184651: 319: return 0;
-: 320: }

Reply 12 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

Here's a size optimized finddynreg. Besides being 10 bytes smaller overall, the big nut is no replication of the for loop, which should improve cache thrashing(gcov shows both loops being liberally used.) I haven't measured the cycles usage difference yet though.

//194 KB (198,811 bytes)
//194 KB (198,801 bytes)
static GenReg * FindDynReg(DynReg * dynreg,bool stale=false) {
x86gen.last_used++;
if (dynreg->genreg) {
dynreg->genreg->last_used=x86gen.last_used;
return dynreg->genreg;
}
/* Find best match for selected global reg */
Bits i;
Bits first_used,first_index;
first_used=-1;
Bits terminate;
Bits increment;
if (dynreg->flags & DYNFLG_HAS8) {
i=first_index=0;
terminate=X86_REG_EBX+1;
increment=1;
} else {
i=first_index=X86_REGS-1;
terminate=-1;
increment=-1;
}
while (i!=terminate) {
GenReg * genreg=x86gen.regs[i];
if (genreg->notusable) { i+=increment; continue;}
if (!(genreg->dynreg)) {
genreg->Load(dynreg,stale);
return genreg;
}
if (genreg->last_used<(Bitu)first_used) {
first_used=genreg->last_used;
first_index=i;
}
i+=increment;
}
/*
if (dynreg->flags & DYNFLG_HAS8) {
// Has to be eax,ebx,ecx,edx
for (i=first_index=0;i<=X86_REG_EBX;i++) {
GenReg * genreg=x86gen.regs[i];
if (genreg->notusable) continue;
if (!(genreg->dynreg)) {
genreg->Load(dynreg,stale);
return genreg;
}
if (genreg->last_used<(Bitu)first_used) {
first_used=genreg->last_used;
first_index=i;
}
}
} else {
for (i=first_index=X86_REGS-1;i>=0;i--) {
GenReg * genreg=x86gen.regs[i];
if (genreg->notusable) continue;
if (!(genreg->dynreg)) {
genreg->Load(dynreg,stale);
return genreg;
}
if (genreg->last_used<(Bitu)first_used) {
Show last 12 lines
				first_used=genreg->last_used;
first_index=i;
}
}
}
*/
/* No free register found use earliest assigned one */
GenReg * newreg=x86gen.regs[first_index];
newreg->Load(dynreg,stale);
return newreg;
}

Reply 13 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I'd expect gcov to not take into account the time spent in the recompiled code,
so it's missing the by far biggest part.
Nevertheless, the FindDynReg looks ok, "threading" FindCacheBlock sounds
useless though (maybe some better algorithm for the block finding, but don't
know how bad the hashing is).

Reply 14 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

Even though I run dynamic core all the time, lots of normal core running. These small changes to the inlined fetches give a big drop, 6574 bytes.

core_normal.cpp:

501065914: 110:static INLINE Bit8u Fetchb() {
250532957: 111: Bit8u temp=LoadMb(core.cseip);
250532957: 112: core.cseip+=1;
-: 113: return temp;


//306 KB (313,408 bytes)
//299 KB (306,834 bytes)
static INLINE Bit8u Fetchb() {
core.cseip++;
// Bit8u temp=
return LoadMb(core.cseip-1);
// core.cseip+=1;
// return temp;
}

static INLINE Bit16u Fetchw() {
core.cseip+=2;
// Bit16u temp=
return LoadMw(core.cseip-2);
// core.cseip+=2;
// return temp;
}
static INLINE Bit32u Fetchd() {
core.cseip+=4;
// Bit32u temp=
return LoadMd(core.cseip-4);
// core.cseip+=4;
// return temp;
}

Reply 15 of 23, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Well you should check then *why* the normal core is being called. Usually this
happens for example with code that has a high degree of self-modification
of instructions (the "opcode" part, not the data of the instruction). So there
may be improvement on that part.

Reply 16 of 23, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

for those fetchb/w/d and such it might interresting to look at the generated asm as well

Water flows down the stream
How to ask questions the smart way!

Reply 17 of 23, by kekko

User metadata
Rank Oldbie
Rank
Oldbie
Qbix wrote:

for those fetchb/w/d and such it might interresting to look at the generated asm as well

I'm quite interested too, for fetch functions as well as other read/write memory functions... 😜

Reply 18 of 23, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Maybe this would be even shorter

return LoadMb(core.cseip++);

Water flows down the stream
How to ask questions the smart way!

Reply 19 of 23, by ih8registrations

User metadata
Rank Oldbie
Rank
Oldbie

It is, very much so.

//940 KB (963,360 bytes)
//908 KB (930,468 bytes)

even for fetchw & fetchd

//908 KB (930,468 bytes)
//897 KB (918,944 bytes)
static INLINE Bit16u Fetchw() {
// core.cseip+=2;
// Bit16u temp=
return LoadMw(core.cseip+=2);
// core.cseip+=2;
// return temp;
}
//897 KB (918,944 bytes)
//890 KB (912,060 bytes)
static INLINE Bit32u Fetchd() {
// core.cseip+=4;
// Bit32u temp=
return LoadMd(core.cseip+=4);
// core.cseip+=4;
// return temp;
}

Need to bench and make sure everything still works.