VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

I'm currently using the following code for low-pass and high-pass filtering all my generated signals:
Header with all typedefs:

#ifndef FILTER_H
#define FILTER_H

#include "headers/types.h" //Basic types!

typedef struct
{
byte isInit; //Initialized filter?
byte isFirstSample; //First sample?
float sound_last_result; //Last result!
float sound_last_sample; //Last sample!

float alpha; //Solid value that doesn't change for the filter, until the filter is updated!

//General filter information and settings set for the filter!
byte isHighPass;
float cutoff_freq;
float samplerate;
} HIGHLOWPASSFILTER; //High or low pass filter!

//Global high and low pass filters support!
void initSoundFilter(HIGHLOWPASSFILTER *filter, byte ishighpass, float cutoff_freq, float samplerate); //Initialize the filter!
void updateSoundFilter(HIGHLOWPASSFILTER *filter, byte ishighpass, float cutoff_freq, float samplerate); //Update the filter information/type!
void applySoundFilter(HIGHLOWPASSFILTER *filter, float *currentsample); //Apply the filter to a sample stream!

#endif

Code for processing the selected filter:

#include "headers/support/filters.h" //Our filter definitions!

void updateSoundFilter(HIGHLOWPASSFILTER *filter, byte ishighpass, float cutoff_freq, float samplerate)
{
filter->isHighPass = ishighpass; //Highpass filter?
if (filter->isInit || (filter->cutoff_freq!=cutoff_freq) || (filter->samplerate!=samplerate) || (ishighpass!=filter->isHighPass)) //We're to update?
{
if (ishighpass) //High-pass filter?
{
float RC = (1.0f / (cutoff_freq * (2.0f * (float)PI))); //RC is used multiple times, calculate once!
filter->alpha = (RC / (RC + (1.0f / samplerate))); //Alpha value to use!
}
else //Low-pass filter?
{
float dt = (1.0f / samplerate); //DT is used multiple times, calculate once!
filter->alpha = (dt / ((1.0f / (cutoff_freq * (2.0f * (float)PI))) + dt)); //Alpha value to use!
}
}
filter->isHighPass = ishighpass; //Hi-pass filter?
filter->cutoff_freq = cutoff_freq; //New cutoff frequency!
filter->samplerate = samplerate; //New samplerate!
}

void initSoundFilter(HIGHLOWPASSFILTER *filter, byte ishighpass, float cutoff_freq, float samplerate)
{
filter->isInit = 1; //We're an Init!
filter->isFirstSample = 1; //We're the first sample!
updateSoundFilter(filter,ishighpass,cutoff_freq,samplerate); //Init our filter!
}

void applySoundFilter(HIGHLOWPASSFILTER *filter, float *currentsample)
{
INLINEREGISTER float last_result;
if (filter->isFirstSample) //No last? Only executed once when starting playback!
{
filter->sound_last_result = filter->sound_last_sample = *currentsample; //Save the current sample!
filter->isFirstSample = 0; //Not the first sample anymore!
return; //Abort: don't filter the first sample!
}
last_result = filter->sound_last_result; //Load the last result to process!
if (filter->isHighPass) //High-pass filter?
{
last_result = filter->alpha * (last_result + *currentsample - filter->sound_last_sample);
}
else //Low-pass filter?
{
last_result += (filter->alpha*(*currentsample-last_result));
}
filter->sound_last_sample = *currentsample; //The last sample that was processed!
*currentsample = filter->sound_last_result = last_result; //Give the new result!
}

But it's running very slowly, even on fast systems(Intel i7 4.0GHz processing a 3.57MHz(Game Blaster) PWM signal by low-pass filtering). Is there a way to optimize this to be fast? Or is this already as fast as it can be?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 15, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

I assume the costly function is applySoundFilter. There are some things you can do.

1) You are paying the price for "if (filter->isFirstSample) " 3.57 million times. Do something so you do not have to check against first sample every.Single.Time.

2) you are also checking isHIghPass. Just have 2 functions one for low and one for high. Either that or make this function templated.

Ideally your high frequency call should go to a function like this:

void applyHighPassSoundFilter(HIGHLOWPASSFILTER *filter, float *currentsample)
{
last_result = filter->sound_last_result; //Load the last result to process!
last_result = filter->alpha * (last_result + *currentsample - filter->sound_last_sample);
filter->sound_last_sample = *currentsample; //The last sample that was processed!
*currentsample = filter->sound_last_result = last_result; //Give the new result!
}

That is basically a FMUL, FADD and FSUB. And few MOVs. This should be able to perform pretty well.

Look at it using Visual Studio disassembler.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 2 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

Well, that seems to exactly be the speed problem:
- The first sample needing to pass without filtering is a hard one to remove: all samples but the first sample are allowed to be filtered. The first sample must always be checked for during filtering and cannot be handled otherwise(or break the entire filter when it does?).
- Would splitting the functionality up actually improve performance, except removing the high-pass vs low-pass check? In 32-bit mode the profiler says it takes some time(a quarter of the time), while the main chunk of the time is taken by the main calculations and storage(the alpha calculation of the low-pass filter, as well as the two final rows in the function, each taking about the same time to execute(~4-5%)

The first sample check barely takes any time in both x86(0.2%) and x64(not seen in profiling) profiling runs. Probably because they only execute(jump) once, once the first sample is to be filtered after initialization of the filter(after a new MIDI channel filter initialization or after emulator initialization/reset). The time taken by the if-else and their contents(the alpha filtering of the entire "if(highpass) highpassfilter; else lowpassfilter;")) disappears entirely from profiling in the x64 version of UniPCemu, reducing the entire function to a mere 5-6% CPU usage.

The whole function takes about 25% of executed emulator time for all used filters in UniPCemu x86.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 15, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

How many filters do you have active? Also can you move them on a separate thread? That will improve your performance as their execution will not impact anything.

Also rather than call the filter for each sample you might want to call the function for a whole buffer at a time.

You can also make the function inline (and put it into a header file) to avoid all the call/ret and stack operations.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 4 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've moved the filters to a seperate low- and high-pass filter function copy, without the low/high pass checks. That improves the performance to it only taking 14.7%. The filter is used 4 times each tick(14MHz/4 periods): once for each output channel(chip 0 left, chip 0 right, chip 1 left, chip 1 right).

In the current commit:
Line 69 takes < 0.1%
Line 73 takes 0.3%
Line 78 takes 2.3%
Line 79 takes 3.1%
Line 80 takes 5.6%
Line 81 takes 3.4%

I'm trying to keep emulation mostly single-threaded, as is advised(only other threads are the Settings menu and Debugger, which both interrupt the main thread emulation core and require the main thread to be running in parallel for input). Also, the code a bit further ahead(the mixing code) might depend on the output(44.1kHz output, currently), which would slow down the code a lot due to threads needing to wait for synchonization 44100 times each second.

Edit: Also, I seem to vaguely remember MinGW or the pspsdk(psp linker) having troubles with inlined code in headers causing multiple definitions, so except by copying it with a custom name(defeating the filter module and portability of it) it cannot be made to work on that system reliably anymore when 'optimized' that way.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

The algorithm I've based it on took the first sample from the buffer as direct output(unfiltered). Will the filter still work properly without an uninitialized(filled with zeroes) previous sample and result? What is the previous sample and result supposed to be when a filter starts up? All zeroed out?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 7 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've modified the filter to initialize the previous sample and result to start out at zero(just like any device that has power applied to it). I've also removed the checks for the first sample. Is that correct?

https://bitbucket.org/superfury/unipcemu/src/ … ers.c?at=master

Edit: I've just found out something: the low pass filtering was also saving the last sample, but doesn't use it during filtering(only high-pass filtering does). After fixing this, the whole balance shifted again, in the low-pass filtering process.

https://bitbucket.org/superfury/unipcemu/src/ … ers.c?at=master

...

Last edited by superfury on 2017-02-05, 01:56. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 8 of 15, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

I've moved the filters to a seperate low- and high-pass filter function copy, without the low/high pass checks. That improves the performance to it only taking 14.7%. The filter is used 4 times each tick(14MHz/4 periods): once for each output channel(chip 0 left, chip 0 right, chip 1 left, chip 1 right).

See! Glad I could help 😀 Don't knock optimizations until you've tried them.

superfury wrote:

Edit: Also, I seem to vaguely remember MinGW or the pspsdk(psp linker) having troubles with inlined code in headers causing multiple definitions, so except by copying it with a custom name(defeating the filter module and portability of it) it cannot be made to work on that system reliably anymore when 'optimized' that way.

I strongly suggest you try to inline that function. Unless you actually have problems inline it then deal with the issues when they happen. That function is a perfect inline-able function and would definitely benefit especially because are calling it so many times per second.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 9 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

Ok. I'll see if I can do the inlinining later today(and test it accross development kits(Visual C++, MinGW(gcc), PSPSDK, Android) to compile).

The balance of the LPF is now as follows:
Line 57: <0.1%(Function opening bracket)
Line 60: 3.2%(Calculating result)
Line 61: 12.3%(Writing result to variables and container)
Line 62: 4.9%(Function closing bracket)

Could it help to make the function a one-liner(formula-wise)? Or maybe a simple define as a one-line assignment?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 10 of 15, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Could it help to make the function a one-liner(formula-wise)? Or maybe a simple define as a one-line assignment?

No, the compiler will already do that for you.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 11 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've converted the filters into one big assignment(low-pass) and multiple one-liner instructions(high-pass) defines. They still take up about 18.6% of CPU time.
4.5%
7.6%
3.6%
2.9%

Using the object-based define in the Gameblaster.c.

Edit: The used 'inline' code:

#define applySoundLowPassFilterObj(filter,currentsample) currentsample = filter.sound_last_result = filter.sound_last_result+(filter.alpha*(currentsample-filter.sound_last_result))

#define applySoundHighPassFilterObj(filter,currentsample,last_resulttmp) last_resulttmp = filter.sound_last_result; last_resulttmp = filter.alpha * (last_resulttmp + currentsample - filter.sound_last_sample); filter.sound_last_sample = currentsample; currentsample = filter.sound_last_result = last_resulttmp

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 12 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've managed to optimize the code a bit:
https://bitbucket.org/superfury/unipcemu/src/ … ter.c?at=master

Although lines 656-659(The low-pass filter) still takes up quite a lot of time when emulating, according to the profiler(34.27% total Game Blaster time, Filtering: 7.2%(656), 11.2%(656), 6.1%(657) and 8.1%(658)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 13 of 15, by Scali

User metadata
Rank l33t
Rank
l33t
superfury wrote:

I'm trying to keep emulation mostly single-threaded, as is advised

This makes sense for the emulation of the actual hardware, since you could see the entire system as a 'single threaded' state machine, stepping one cycle at a time. Making separate threads for separate components will just make you have to resync all threads at each step. Depending on the complexity of the steps, the amount of overhead for multithreading may not be worth it. Besides, it makes debugging considerably more difficult.

However, one can argue that filtering the audio signal is no longer part of the actual emulation of the hardware. At least, it is not part of the state machine you're implementing.
The state machine will just provide a cycle-accurate stream of raw audio data. Once you have that raw data, there's no reason why you can't pass further processing off to another thread.
The same argument can be made for video data. The raw framebuffer data needs to be cycle-accurate, but if you want to add extra filtering/stretching, NTSC emulation or whatever else, this can be done in a separate thread.
Since these post-processing steps tend to be rather CPU-intensive (and don't work on a per-cycle basis, but process data in bigger chunks), I think multithreading is a good approach here.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 14 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

The only things that are currently multithreaded(in the debugger thread, Settings menu thread or the timer thread(remaining cases)) is the MIDI active sense timer, PSP keyboard swap handler(handling the LTRIGGER input to swap keyboards sets(LTRIGGER-response) and modes(Select/Start responses)), and (in their own threads) the debugger(to be able to handle input while the CPU is paused easily) and Settings menu(same case as the debugger thread). The video rendering cannot be in a seperate thread, because the VGA constantly accesses display memory and affects output by changing resolution each frame(became very slow when multithreading, due to Semaphore locks).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 15 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

One thing I'm wondering about: The Game Blaster uses PWM to provide different volumes(16 levels of volume). Won't mixing the raw PWM streams together before low-pass filtering destroy the intended result? Will the output still be correct? What happens if you mix multiple PWM streams and then low-pass filter them, instead of low-pass filter to get the correct output levels, then mixing(adding) them together?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io