I swear, if I had real employment there wouldn't be nearly as many updates as there have been recently. Anyway, after a serious bout of insomnia, I've added quite a few changes to this version. This includes an all new envelope management system and some optimizations courtesy of ih8registrations. I'm still not quite there yet, but again, I think everyone will agree its another good step closer. Strangely, in this version I've noticed some disappearing instruments on occasion (particularly in the Monkey Island 1 theme). I'm trying to track it down. I'm sure its some careless bug in the new envelope code.
Two extra
'pStat->prevlevel[PITCHENV] = tc;
return tc; '
are still shorter than one extra
'tc = pStat->envbase[PITCHENV];
tc = (tc + ((pStat->envdist[PITCHENV] * pStat->envpos[PITCHENV]) / pStat->envsize[PITCHENV]));'
size opt: deault false assign for pStat->pitchsustain. only one case where it's true. cost of default assign offset by saved jump from the immediate return tc and because that happens more than once, updated function is still faster overall.
Atleast in Visual C, immediate returns are no more optimized than just letting the code run to the end of the routine. The second set of dividing to calculate was to calculate for the decaying side of the envelope. Once decayed, the code should not be allowed into the standard block because once the envelope position (envpos) extends past the envelope size (envsize) the code then moves to the next part of the envelope. The final decay, of course, is the end of the line. This is still needed because even though the pitchenv could be complete in its decay, the other two envelopes (amplitude and filter) could still be far from complete decay. For informational purposes, here's Visual C's generated assembly code for this routine:
Fair enough. As for the second caclulation, it sounds like you think I only applied the divide to calc for the first case? Barring a bug, the rewrite is functionally equivalent. The handling of both cases were moved outside of the if else structure and became the general case. For the other cases, they should return without hitting it.
The asm readout is of my rewrite? It looks like a mix of old & new. duplicate div to calc of old is in there.
it looks like it's not setting pStat for this case, 1296's jump to L68826. To match what I wrote it would need to jump to L68830. 1329 is part of the problem as 1326 should not be inside the else.
If that's Visual c's interpretation of my code I'm not impressed, goes against what I told it to do by reversing my size optimization & introduces a bug:P
C does give the power to strictlly tell the compiler what to do by way of the goto statement. Frowned upon in polite society but as you can see it's what the compilers doing anyway and the most direct way to specify forward jumps in c.
How to tell Visual c to do the ending tc calc just once, but that I really mean it this time, I'm unsure. It should be doing what I tell it to as is.
ps. you may have noticed all the indexed refs of pStat->envXXX[idx] are four byte addressing + four byte base + 1byte immediate and when stored put into a four byte register. For size optimization, if there's three or more references to XXX without modifiying, copying to a temp variable before using will save. If modified, six or more will save. stat qualifies but just so, would save a whole two bytes:)
Last edited by ih8registrations on 2003-07-27, 11:22. Edited 1 time in total.
Merging the outside loop saves 100 cmp, inc, & jmps, as well as probably a mov since there's probably enough going on to need reloading the counter. That's the 1k. The 30k comes from merging the two inner loops for 100 iterations. again, saving a cmp, inc & jmp, probably not a mov, *100 inner * 100 outer. If we lowball & say they all take only one cycle, possible, ignoring pontential stalls, other, then it's 3*100*100 + outer 1k; 31k. There's several other outside loops in InitTables than can be merged and some other tweaks, for about another 5k or so that I saw, but this is the biggest savings to be had. The cost is the four lines of duplicated code but for 31k cycles, I can live with that:)
following the initial one should be 'else if''s' to avoid needlessly doing all the following checks once having already found & executed the matching range.
Wow... thanks for all the updates. I'm having trouble keeping up. As for the pitch envelope, it needs those duplicate divs because one manages the attack form of the envelope while the other one manages the decay form. Thanks again for your changes. I'm not too incredibly worried about the table generation. On my Celeron 1333Mhz it takes about half a second to generate all the tables--with most of this being consumed by the table generation for the lowpass filter. The real are of concern is the main processing area, the getSample routine. Its in that subroutine where optimizations will be most valuable.
61 // PCM partial 62 if(tcache->rawPCM>53) { 63 if(tcache->rawPCM>=74) { 64 if (partCache->PCMDone) { 65 pOff->pcmabs =0; 66 partCache->PCMDone = false; 67 } 68 pcm = PCMReassign[tcache->rawPCM - 74]; 69 } else pcm = PCMReassign[tcache->rawPCM - 54]; 70 } else pcm = tcache->convPCM; 71 72 delta = wavtabler[pcm][noteval]; 73 74 if (!partCache->PCMDone) { 75 int ra, rb, addr = PCM[pcm].addr; 76 if(delta<0x10000) { 77 // Linear sound interpolation 78 ra = romfile[addr + pOff->pcmoffs.pcmplace]; 79 rb = romfile[addr + pOff->pcmoffs.pcmplace+1]; 80 ptemp[t] = (ra + (((rb-ra) * pOff->pcmoffs.pcmoffset) >>16)); 81 } else 82 ptemp[t] = romfile[addr + pOff->pcmoffs.pcmplace]; 83 84 if ((pOff->pcmoffs.pcmplace) >=PCM[pcm].len) { 85 if(PCM[pcm].loop) 86 pOff->pcmabs = 0 87 else partCache->PCMDone = true; 88 } 89 } 90 } else { 91 // Synthesis partial 92 int divis, ofs3, toff, wf; 93 94 toff = pOff->pcmoffs.pcmplace; 95 divis = divtable[noteval]>>15; 96 97 if(pOff->pcmoffs.pcmplace>=divis) pOff->pcmabs = (pOff->pcmoffs.pcmoffset % divis); 98 99 if(tcache->waveform == 0) { 100 // Square waveform. Made by combining two pregenerated bandlimited 101 // sawtooth waveforms 102 int divmark = divtable[noteval]>>8; 103 104 ofs3 = (toff + ((divmark*pulsetable[tcache->pulsewidth])>>16)) % (divis >> 1); 105 106 ptemp[t] = waveforms[0][noteval][toff % (divis >> 1)] + waveforms[1][noteval][ofs3]; 107 } else { 108 // Sawtooth. Made by combining the full cosine and half cosine according 109 // to how the MT-32 does it. This is identical to the MT-32's operation 110 wf = 2; 111 if(toff >= sawtable[noteval][tcache->pulsewidth]) wf++; 112 ptemp[t] = waveforms[wf][noteval][toff]; 113 } 114 ptemp[t] = getFiltEnvelope(ptemp[t],partCache,tmppoly,partCache->decaying[FILTENV]); 115 } 116 // Build delta for position of next sample 117 delta = (delta * finetable[tcache->fineshift])>>8; 118 delta = (delta * pdep)>>8; 119 delta = (delta * lfoat)>>8; 120 121 // Add calculated delta to our waveform offset 122 pOff->pcmabs+=delta; 123 124 // Put volume envelope over generated sample 125 ptemp[t] = (ptemp[t] * (int)ampval * (int)v) >> 14; 126 127 for(int envnum=0;envnum<3;envnum++) partCache->envpos[envnum]++; 128 } 129 } 130 if(isDone) { 131 tmppoly->isPlaying = false; 132 tmppoly->isDecay = false; 133 } 134 // Post process partials and bring them together 135 int temps, s1, s2, i = 0; 136 *lspecial = *rspecial = 0; 137 for(int z=0;z<2;z++) { 138 if(z==0) { 139 temps = mt32ram.params.patch[patch].common.pstruct12; 140 s1=0; 141 s2=1; 142 } else { 143 temps = mt32ram.params.patch[patch].common.pstruct34; 144 s1=2; 145 s2=3; 146 } 147 if(!pcache[s1].playPartial) s1=4; 148 if(!pcache[s2].playPartial) s2=4; 149 //LOG_MSG("z %d ps %d, s1 %d s2 %d", z, temps, s1, s2); 150 151 temps = PartMixStruct[temps]; 152 153 switch(temps) { 154 case 0: 155 // Standard sound mix 156 i+=ptemp[s1] + ptemp[s2]; 157 break; 158 case 1: 159 // Ring modulation with sound mix 160 i+=(((ptemp[s1] * ptemp[s2])>>WGAMP) + ptemp[s1]); 161 break; 162 case 2: 163 // Ring modulation alone 164 i+=((ptemp[s1] * ptemp[s2])>>WGAMP); 165 break; 166 case 3: 167 // Stereo mixing. One partial to one channel, one to another. 168 *lspecial += ptemp[s1]; 169 *rspecial += ptemp[s2]; 170 default: 171 i+=ptemp[s1] + ptemp[s2]; 172 break; 173 } 174 } 175 if (!isRy) { 176 // Mix standard tibre 177 c += i; 178 } else { 179 c = 0; 180 // Drums have their special, built in panpot locations 181 *lspecial += ((i * drumPan[tmppoly->pcmnum][0]) >> 8); 182 *rspecial += ((i * drumPan[tmppoly->pcmnum][1]) >> 8); 183 } 184 //tmppoly->pcmoff.pcmabs +=tmppoly->pcmdelta; 185 } 186 } 187 return c; 188}
/*
got rid of linefeeds for whitespace, indentions suffice; easier to trace with more on a page
partplay = DPOLY made part of if else than setting than overriding if isRy
moved int i, r init to where they are used
moved *lspecial = *rspecial = 0 to where they are used, same place as int i;
removed int x, shitguard, unused
removed Bit32u tmpoff, unused
removed bool playwav = true, unused
removed int v & v = volume, unused
init c to 0 moved to bottom of function into conditional
cleaned up calculate lfo position
removed unneccessary temp var pd
cleaned up pcm partial
cleaned up synthesis partial
*/
Again, I think you're misunderstanding my code change, or I'm not understanding what you're saying; my change in the code still does the div for both cases, it just doesn't have two copies of the call; it's a space saving optimization.
Next up to optimize for getSample are the functions it calls.
patchCache is a big structure and the code doing a default assignment of it. If the instrument is a drum, it does another load of this big structure. ugh.
This is an optimization for when the if playpartial && isdecayed check doesn't fall through at the cost? of doing the check of isDecayed referencing tmppoly=pStatus[t].
There are no memory moves here. tcache and partCache are pointer variables, not the actual structures in memory. As such, no memory is copied. The structures could be 1 byte in size or 256MB in size, and this code would execute equally as fast. If I used actual structure variables rather than pointer variables, such a consideration would be an optimization. But again, these are pointers.
Have you read Michael Abrash's Zen of Code Optimization? In it, he goes through the ways one could "count cycles" and so forth. His ultimate conclusion though is that counting cycles can only go so far. The best form of optimzation that Abrash suggests is complete reinnovation and rethinking of the algoritmn. A good example was the change from the envelope caches to the evelope timer in my code. Not only was it more precise, its also a good deal faster. This is the kind of code optimization I'm looking for. If there is a faster, more precise way of lowpass filtering that matches the MT-32's output, that's what I need. I need an efficient reverb algorithm. I feel that I could better generate pulse width modified squarewaves without combining two bandlimted square waves. These are the places where the greatest speed benifit will be seen.
I have profiled the code but I've found that getting reliable, clear results is very diffcult. This is because the music varies in its demand on certain parts of the emulator. PCM samples are easier to play than the analogue synthesis. Likewise, sawtooths are easier to synthesize than square waves. As such, music that's biased in one of these areas will skew results.
Sounds like individual test cases are needed for each code path. To do that one way that comes to mind is to use a midi sequencer. The midi sequencers I've ever played with allowed you to turn off channels. Combine it with one or a few midi files that use the various types; pcm, synth, drums, xyz effects and you'll have playback that isolates them.
Last edited by ih8registrations on 2003-07-30, 22:24. Edited 1 time in total.