How to render audio (main thread) efficiently with SDL(audio thread)? \ VOGONS

How to render audio (main thread) efficiently with SDL(audio thread)?

Topic actions

First post, by superfury

Posted on 2016-02-04, 12:25

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5785
Joined: 2014-03-08, 11:25
Location: Netherlands

I currently have a main thread (CPU and hardware) that's rendering audio after each CPU instruction. The audio is double buffered(one buffer for the CPU thread, once filled enough it's moved to the buffer(locked with a semaphore when copying, using the FIFOBuffer functionality I've written) of the rendering thread. The rendering thread reads this buffer (locking it on every sample) and passes it to the mixer(which is also written specifically for my emulator), which in turn passes it back to SDL(Audio thread routines of SDL).

It works correctly, but I constantly hear plops in the sound (many plops each second, like a low frequency (I think at least once every 4096 samples out of 44.1kHz signal) at least 8 plops each second. It sounds like a cracky sound, but audio is still recognizable (testing it with 8088 MPH atm, hearing drums only(with varying tone frequency) at the credits, working on a 2GHz processor).

Running Bill & Ted's Excellent Adventure gives me correct sound, but with about 8Hz stutter(like an audience shouting).

All sound is buffered by the CPU(main) thread before sending it to it's respective rendering thread(PC speaker, Adlib, Disney Sound Source/Covox speech thing). All these virtual devices render directly in the CPU(main) thread to their respective rendering buffers. The only one that renders directly on the rendering thread itself is the MIDI SF2 synthesizer(it renders all it's audio directly in the rendering thread).

The PIT emulation(PC speaker) first calculates it's 1.19MHz signal for all 3 channels. Then it processes channel 0 into up to 1 IRQ0 interrupt. It discards channel 2 data(not connected to anything we need).
After that channel 3 is sampled 44100 times a second, using 60us PIT output samples at the PIT rate. This generates the proper 44.1kHz audio stream of the PC speaker.
This 44.1kHz signal is written to the first buffer. Once the first buffer passes a threshold, threshold samples are moved from the first buffer to the rendering buffer(which is locked with a semaphore) in one go(lock move unlock).

The Adlib emulation renders it sound at the about the same way, but at ~49kHz frequency clocked by the CPU like the PIT counters and output. The rendering first processes the 80us/320us/CSM timer for the CPU time passed. Then it renders it's adlib output to it's first buffer. When the first buffer is filled up to the threshold, it's contents are moved to the rendering buffer, like the other channels also do in their final rendering process.

The Sound Source emulation renders it's Sound Source input to it's Sound Source primary FIFObuffer(containing up to 16 samples for accuracy and detection). Covox output by the CPU emulation simply sets a Covox left and right channel value(byte value). The Sound Source and Covox rendering routine handles Sound Source output first(for the CPU elapsed time), then the Covox ouput.
The Sound Source rendering(which happens at 7kHz rate) first moves the sound source input(from the 16 sample buffer) to it's secondary buffer(this also enabled accurate detection by the CPU by checking for empty/full primary FIFO buffer). When the secondary buffer threshold is exceeded it moves threshold samples from the secondary buffer to the renderer buffer(which is locked copied then unlocked).
The Covox rendering writes the current value of the left channel and right channel (2 times 8-bit current values(which have been set by the CPU emulation) combined into one 16-bit stereo value) to it's primary FIFO buffer. Then it moves the buffered content in blocks, like the Sound Source, to it's rendering buffer(which is locked copied then unlocked), size set by the threshold like the other channels.

Information about buffer sizes, thresholds before rendering and rendering frequencies:
- PC speaker:
Rendering frequency: 44.1kHz
Double buffer threshold: 256 samples.
Rendering buffer size: 512 samples.
- Adlib:
Rendering frequency: ~49kHz.
Double buffer threshold: 16 samples.
Rendering buffer size: 4096 samples.
- Disney Sound Source:
Rendering frequency: 7kHz
Primary buffer size: 16 samples.
Double buffer threshold: 4 samples.
Rendering buffer size: 1024 samples.
- Covox (Speech Thing, based on information of Dosbox's Sound Source/Covox emulation):
Rendering frequency: 44.1kHz
Primary buffer size: Same as double buffer threshold.
Double buffer threshold: 409 samples.
Rendering buffer size: 4096 samples.

Is this efficient enough? All rendering from the running buffers (last buffer before moving to the renderer, depending on the device, as described above) to the rendering buffer happens using a locked FIFO on the renderer's end(which is read by the rendering callback one lock, sample, unlock at a time).

The rendering still is giving plops (about 8Hz) all the time for all audio output(except the MIDI, since it doesn't use the FIFO buffers, rendering directly instead), even with the double buffering before passing it to the renderer(it makes the overall quality better(less plops) but they still happen).
To hear the current result, simply run the release of my x86EMU emulator(running software or BIOS which actually produces sound on that device of course(BIOS boot beep for PC speaker, games playing sound using the PC speaker/Adlib/Sound Source/Covox/MIDI(requiring a soundfont(.sf2) selected in the BIOS)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 7, by realnc

Posted on 2016-02-04, 15:31

realnc Offline

Rank Oldbie

Rank: Oldbie
Posts: 579
Joined: 2010-10-13, 11:02

If you really do locking on every single sample, then this might be your issue. Locking is a slow operation which can result in the audio buffer starving and thus you can get audio dropouts. And perhaps worse, it's very difficult to tell which parts of the whole are waiting for locks and stalling if you do so much excessive locking. Too much locking can easily turn multi threaded code into something that performs even worse than single threaded code.

You should be able to do mixing in the SDL audio thread (that is your SDL callback implementation) to simplify matters. So in total, you have at least two threads. SDL's audio thread (which invokes your callback), and your rendering thread (where you generate the audio samples). The rendering thread fills the output buffers, and the audio thread mixes them into SDL's audio buffer. You then have one mutex that you lock for all the output buffers.

There should be no need for the rendering thread to copy any data to the output buffers. The rendering thread can simply fill each output buffer *outside* the critical section, and, once they're filled, acquire the output buffers lock and simply swap ("flip") the pointers so that the output buffers point to the new samples. The rendering thread can then go on producing new samples while the audio thread mixes and outputs the old ones.

So the double buffer setup here means that every emulated sound device writes to its own output buffer without any locking, and once the required amount of samples has been written to each buffer, a lock is acquired and the buffers are flipped without any copying. If for example you have 5 buffers you need to mix, you use 10 (in reality they can still be 5, but twice the size.) Render stuff into the first 5. When they're filled, get the lock, flip those 5 buffers with the other ones, release the lock and continue. Every time they're filled again, lock, flip, unlock, continue. You constantly fill and flip.

There is no need for ring buffers in this setup. Each buffer is a simple array, twice the size of the SDL output buffer size. Flipping means swapping the output buffer pointers to point to the first half and second half of the arrays. The audio thread mixes and outputs one half, the rendering thread produces and writes samples in the other half.

(Obviously you must not flip if the old buffers have not been sent to SDL yet. You need to wait before flipping. If your setup is correct though, the waiting should be very short.)

Reply 2 of 7, by superfury

Posted on 2016-02-04, 15:51

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5785
Joined: 2014-03-08, 11:25
Location: Netherlands

Essentially the same happens with my code, but not with pointers to a (to be) flipped buffer, but with the actual data to render. The threshold determines how many samples are actually 'flipped' (not flipping in this case, but moving from output buffer containing some samples to the renderer's buffer). So if I understand correctly, I just need to make sure the double buffer threshold size is equal to half the renderer's buffer size?

The only difference is that instead of fully replacing the buffers(by swapping pointers), in my case samples are simply added while the renderer is locked. So I just need to adjust the threshold size to half the renderer's size?

Though I must admit it's rendering at about 30% speed, since the CPU emulation runs at 30% speed(to which all the emulated video&sound rendering is tied for software timing accuracy). So it could be because the CPU(thus audio) runs too slow/heavy?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 7, by realnc

Posted on 2016-02-04, 16:06

realnc Offline

Rank Oldbie

Rank: Oldbie
Posts: 579
Joined: 2010-10-13, 11:02

You said in the original post that you acquire a lock on each sample:

superfury wrote:
The rendering thread reads this buffer (locking it on every sample) and passes it to the mixer

What I'm proposing is never locking except for the flipping (or copying.)

Reply 4 of 7, by superfury

Posted on 2016-02-04, 20:46

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5785
Joined: 2014-03-08, 11:25
Location: Netherlands

That locking on every sample is only in the rendering thread(function called by my mixer which is in turn called by the SDL audio thread(started by SDL_openAudio)). The CPU thread(main thread) renders samples one device(PC speaker, Adlib and Sound Source+Covox) at a time, accumulating the data in a final FIFO buffer. When with those final buffers the threshold is reached(enough samples accumulated), the FIFO submodule will lock the renderer's FIFO to check if there's enough space to store all samples accumulated(the amount being the threshold samples constant). Then, while still having the renderer's buffer locked, it will move those samples in a fast, lockless loop to the renderer's buffer. After that, the renderer's buffer is unlocked and the function returns to the caller to render more samples or continue execution.
https://bitbucket.org/superfury/x86emu/src/60 … fer.c?at=master

readfifobuffer(8-bit sound source) and readfifobuffer16(all other channels(mono 16-bit), including stereo covox(2x8bits)) is called by the renderer.
movefifobuffer8 and movefifobuffer16 is called by the main(CPU) thread with threshold set to the device threshold, source buffer containing currently rendered samples(not locked), destination is the rendering thread's buffer(the renderer reads using #one lock, sample, unlock# to allow this just in time process to fix it's last buffer just in time(too late with a full rendering lock)) at a time to allow the thread moving data to move data to render 'just in time'(e.g. move the last block needed for the renderer to render(second half in this case) to the rendering thread while it's still processing the first half). This is done to prevent missing blocks(second halves or first halves unavailable).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 7, by realnc

Posted on 2016-02-04, 21:51

realnc Offline

Rank Oldbie

Rank: Oldbie
Posts: 579
Joined: 2010-10-13, 11:02

Your terminology confused me 😜 Anything that is called from the SDL callback is the audio thread. The rendering thread is the thread that renders samples (like your AdLib code.)

Anyway. Have you checked how many samples are moved each time to the buffer? If you lock on each sample, you might be running into the situation where you move only one sample at a time. (The audio thread locks on a single sample, copies it to SDL's buffer and releases the lock. The other thread acquires the lock, sees that there's room for one sample in the buffer, copies it, releases the lock. The audio thread locks again, etc, etc.)

Even though there's no guarantee that the locking is causing the issue, I believe it's important to get rid of locking on every sample. That's just insanely inefficient. For 48kHz audio, you're locking 48 thousand times per second. Locking is *expensive*. I suspect this is the issue, but I can't be sure, of course.

Reply 6 of 7, by superfury

Posted on 2016-02-05, 04:40

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5785
Joined: 2014-03-08, 11:25
Location: Netherlands

The only thing that's locking 48000 times per second is the SDL callback that reads the samples from the buffer. The main thread(running the CPU and other hardware too) gives the SDL callback data in blocks (the device's double buffer threshold is the size that gets moved from the main thread's rendering to the SDL callback with one lock).

So if the double buffer size is 4096 samples, the main(CPU) thread renders 4096 samples at the CPU synchronized speed to the immediate buffers, then when those 4096 samples are ready and the rendering buffer has 4096 samples free, it locks the rendering buffer(making the SDL callback stall until done) and moves the 4096 samples to the rendering buffer. Then it releases the lock and continues. When released the callback can continue where it left off, rendering the 4096 added samples on top of the samples it still was rendering (if it was already running during interruption).

So in short, the rendering process (main thread) renders whole blocks into the renderer buffer(one lock per block of samples). The renderer callback (SDL callback) reads those samples one at a time with one lock per sample.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 7 of 7, by superfury

Posted on 2016-02-06, 11:14

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5785
Joined: 2014-03-08, 11:25
Location: Netherlands

I've updated all doublebuffer sizes to their rendering buffer size. PC speaker(44.1kHz) and Adlib(49kHz) give sound with 50% silence, then continuation of sound etc. The covox gives audio without stuttering at all? The adlib does the same as the pc speaker, but with slightly longer silence(same buffer size as PC Speaker, but with higher sampling fate becomes more loss or delay)?.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Go to top of page Go to top of page

Back to PC Emulation