This is nearly done. I must say that it was enough fun but looked too scary to me before, so thanks for the request 😉
However, I also must point out to the performance of the float mode rendering. It involves accurate trigonometric calculations instead of LUTs, hence a lot slower. For example, smf2wav-ing of Dune's Water takes only 7.2 sec with the default renderer and no less than 48.8 sec with floats...
Well, this difference actually depends on the compiler and the runtime it uses. That weird figure is for GCC 5.3, and with VS2015 I get it substantially faster and the smf2wav conversion completes in 10.3 sec only.
Thanks for implementing Floating-Point rendering mode. But there seems to be an FP denormals problem relating to this mode.
I have written a VSTi plugin based on Munt and it works perfectly in INT16 mode. I have compiled Munt as a shared library with the C API. Note that I always use 'mt32emu_render_float()' even in INT16 mode since in VST world floating-point is mandatory. The problem is that after some idle period in FP mode Munt starts to eat CPU. I have isolated the problem to Munt itself, since this phenomenon also happens when I mitigate denormals in my plugin (by adding 1.0e-24). Moreover it also happens when I do not even send sample data to the VST host just using 'mt32emu_render_float()' to get data from Munt.
The situation is more problematic in case of my plugin since it uses 2 instances/contexts of Munt to simulate 16 channel GM mode. So the problem is much more noticeable when both synths start to eat CPU. Also the CPU overhead is higher when higher sample rate is used (e.g. 96000). Here is a video about the problem:
According to my profiling results (AMD CodeAnalyst) the methods that are mostly responsible for the high CPU usage during idle periods are the following:
In sampling session 61240 CPU clocks:
1. MT32Emu::AccurateLowPassFilter::process (60 % of samples, IPC: 0.15)
2. MT32Emu::BReverbModelImpl<float>::produceOutput<float> (10 % of samples, IPC: 0.06)
3. MT32Emu::CombFilter<float>::process (5 % of samples, IPC: 0.09)
4. MT32Emu::AnalogImpl<float>::produceOutput<float> (5 % of samples IPC: 0.15)
5. MT32Emu::weirdMul (5 % of samples, IPC: 0.14)
They are about a magnitude slower during idle periods thanks to denormals.
Contrary the sampling result of a normal run (when playing):
In sampling session 21189 CPU clocks:
1. MT32Emu::AccurateLowPassFilter::process (31% of samples, IPC: 1.38)
2. MT32Emu::LA32FloatWaveGenerator::generateNextSample (9.8% of samples, IPC: 0.98)
3. _exp_pentium4 (6% of samples, IPC: 1.13)
4. MT32Emu::weirdMul (3.3 % of samples, IPC: 0.65)
5. MT32Emu::CombFilter<float>::process (2.5 % of samples, IPC: 0.75)
8. MT32Emu::BReverbModelImpl<float>::produceOutput<float> (1.5% of samples, IPC: 0.45)
I have successfully mitigated the problem to around 1-2% of CPU overhead during idle periods by modifying Analog.cpp and BReverbModel.cpp:
Thanks for reporting about it. I also suspected about something like that in the float reverb implementation but I see no such drastic effects on my system. Certainly, the reverb needs a measure to fight against denormals, I'll look into it closer on the weekend. Though, I think there is no need for it in Analog (due to the filter being effectively finite impulse response).
Thanks for your respone.
I do not want to argue since I'm far from fully knowing the inner world of Munt but I would like to mention that on my system the biggest problem is with MT32Emu::AccurateLowPassFilter::process. And it's in Analog.cpp. As you can see in the profiling results Instuctions Per Clock (the best efficiency indicator) is one magnitude worse in MT32Emu::AccurateLowPassFilter::process during idle periods. So at least on my system Analog is definitely affected.
I do think it's related to denormals since it reacted to the changes well. But of course it could also be a side effect.
Indeed, AccurateLowPassFilter consumes more of CPU than the reverb stuff just because it does so in any case 😉
But note, a FIR filter produces denormals in most cases only if it receives denormals on its input. The reverb processing code is no doubt subject to the denormal issue as it relies on recursive algorithms. Hence, I suppose, you can safely nuke the biasing change in the Analog. Though, you may need to adjust the bias a bit.
Indeed, AccurateLowPassFilter consumes more of CPU than the reverb stuff just because it does so in any case
Yes, I know it. That's why I mentioned IPC since it has nothing to do with absolute consumed CPU cycles but the efficiency of the given amount of cycles.
So it's true that AccurateLowPassFilter is the biggest consumer in both normal and idle periods but in the later case it is tenfold slower (it requires ten times the CPU cycles to achieve the same amount of job -> IPC: 1.38 vs. IPC: 0.15).
you can safely nuke the biasing change in the Analog. Though, you may need to adjust the bias a bit.
Do you have a recommendation? I used such a relatively high value since I have read in the past that the Pentium 4 had problems with even denormals considered normal to other CPU's.
More precisely I have found an example where it was mentioned that even 1e-25 can be problematic on P4 so 1e-24 is recommended.
I have no idea regarding this kind of P4 weirdness, sorry. Anyway, considering the capabilities of real audio applications and hardware, I may admit that even 1e-10 (i.e. -200dB) can be used as a bias. Alternatively to biasing, a minimum sample value threshold may be implemented, though this way is seemingly more complicated and slower.
And this is from a PDF titled 'Tobybear's VST Plugin Template and Tutorial '(I cannot find it online anymore).
What is this Pentium 4 denormal bug?
A lot of audio programs and plugins currently have problems with the Pentium 4:
occasionally high CPU spikes (sometimes over 500%) occur, even when there is no
audio processing taking place, sometimes crashing the whole system because of
system overload! This affects many shareware plugins, but also the big companies,
Steinberg Grand and Waldorf Attack for example. The reason behind it is an Intel
"bug" (or probably "feature" in their terms): the P4 goes into a special "super-precise"
mode called the "denormalisation" mode once floating point numbers get very small.
This is normal and almost every other CPU dealing with floating point numbers does
Unfortunately, calculations in this mode take up much more of the CPU processing
power and are much slower, so for realtime applications a CPU should not spend too
much time in this "denormal mode".
There is no sure way to switch off this behaviour, at least none as far as I know. The
problem existed on the Pentium 3 too, and there were workarounds like adding some
"quiet" noise to the signal to keep the floating point numbers always above the
threshold. On the Pentium 4 this threshold is much higher though: from my
experiments it changed from about 1.0e-25 to 1.0e-24, that is a factor of 10! Which
does of course mean that the program stays 10 times longer in this "denormal" mode,
causing some programs which run smoothly on the P3 to stutter or even crash on a
There seems to be a new generation of P4s that allegedly do not have this
pronounced denormal effect (still worse than the performance of an AMD though),
but don't forget: your algorithm should *always* be checked for denormalisation
problems, as it is a general problem with floating point numbers on almost any
processor. There are excellent resources about this topic, it would probably be a
good idea to go to http://www.musicdsp.org and read the various papers there.