VOGONS


First post, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Thought others might find this experimentation of interest. Some games are currently not happy like msfs 5.x, stunt island, xwing, but works without a hitch for many. synthesized audio aka fm etc., dac processing, and mixer mixing should all be offloaded to the thread. Threading overhead can be dropped to nill but initial methods were heavy. Read comments in the diff. If it weren't for xwing which will start mute but occasionally plays fine, fm playback would be error free from what testing I've done, the other two have an issue with dac playback.

audio interfacing is done through the mixer, callback handler for an audio device is installed to mixer, the pic runs mixer_mix, mixer_mix runs the handlers after peeling away a couple of layers, callbacks all conform to the same pattern of generate synthesized data, mixer addsample. exception to the rule is mpu401 which does things differently, ignore that guy.

adlib
mixerchan.install
opl_callback
generate
opl2
adlib_getsample
addsamples_m16
opl3
adlib_getsample
addsamples_s16
mameopl2
ym3812_update_one
addsamples_m16
mameopl3
ymf262_update_one
addsamples_s16
dbopl DosBox opl?
generate
opl3
generateblock2
addsamples_m32
else
generateblock3
addsamples_s32
disney
mixerchan.install
disney_callback
stereo
disney_playstereo
buffer=
addsamples_s8
mono
buffer=
addsamples_m8
gameblaster
mixerchan.install
cms_callback
sound_stream_update
addsamples_s32
gus
GUS_DMA_Callback
r/w between system and gus ram
mixerchan.install
gus_callback
generatesamples
addsamples_s32
mpu401
operates diff from the rest
opl
adlib_getsample
pcspeaker
mixerchan.install
pcspeaker_callback
stream(mixtemp buffer)=
addsamples_m16
sblaster
mixerchan.install
sblaster_callback
none/pause/masked
mixer addsilence
Show last 46 lines
      dac 
mixer addstretched
dma
generatedmasound
dsp_dma_2
decode_adpcm_2_sample
addsamples_m8
dsp_dma_3
decode_adpcm_3_sample
addsamples_m8
dsp_dma_4
decode_adpcm_4_sample
addsamples_m8
dsp_dma_8
stereo
dma.chan->read
!signed
addsamples_s8
signed
addsamples_s8s
mono
dma.chan->read
!signed
addsamples_m8
signed
addsamples_m8s
dsp_dma_16
dsp_dma_16_aliased
stereo
dma.chan->read
!signed
addsamples_s16
signed
addsamples_s16u
mono
dma.chan->read
!signed
addsamples_m16
singed
addsamples_m16u
tandy
mixerchan.install
SN76496Update
sound_stream_update
addsamples_m16

Attachments

Reply 1 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

No comments? Not as much interest after all. fm overhead used to be a topic of discussion, but I guess not so much anymore. A solution to xwing being mute was to reduce cycles, as if it were an issue with sound card detection, perhaps pic cycling too fast with audio in a separate thread. As for dac playback, thinking dma might need some mutexs, when I added a log_msg to the addsample msfs5.1 was using, it added enough delay to keep msfs from wigging out, so just need to figure out where to put another hold so things don't stomp over each other in that state. edit: another way to get xwing to prevent the mute issue but without reducing cycles is using the dosidle utility: https://archive.org/details/dosidle

Reply 3 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Yes, what I note in the diff comments. I need to work out why, if I did the threading wrong, missed something(s) that are linked that keep from processing simultaniously, somehow still processing in main, or just that processing audio isn't significant anymore.

Reply 6 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

Checking mixer_mix(main thread), addsamples, dma read, pic_runqueue, normal_loop, and callbacks(ex. LOG_MSG("mythreadID sblaster GenerateDMASound %d",SDL_GetThreadID(NULL));) shows the audio handling is running in mixer_mix_thread as desired, with pic_runqueue and normal_loop in main shown to be running asynchronous, threaded.

Reply 7 of 30, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

awgamer,

I think your approach is the right direction given actual hardware behavior:

[  Hardware CPU -> IO BUS Event =-= HW CPU continues ... ]
`-[ .. ISA hardware processes audio in parallel ... ]

vs.

[  Dosbox CPU -> IO BUS Event | | DOSBox CPU unblocked ... ]
^
`- Emulating the ISA hardware (OPL/GUS/SB/...)
blocks the Dosbox core to generate 1 ms of audio. This
won't interrupt the frame rate or cause audio hickups
provided the host CPU has enough spare headroom available.

If you're running DOSBox on a CPU that doesn't have enough headroom then the serial time to generate the audio will introduces a gap in the stream, stuttering the audio.

All of us seasoned users know that means, "I've given this game too many cycles..time to back it off a bit." Eventually you find the sweet spot where there's enough headroom to absorb the audio-generation bursts without breaking the audio stream. This is a pretty rare situation on run of the mill x86 hardware, but becomes very common on the Pi3 and 4 running some of the more demanding 1995+ era games where framerates can be borderline.

If that audio generation could be performed asynchronously on another core, then you could run your cycle count much closer to the CPU's maximum, while the second core would be more than sufficient to handle the audio generation.

(Sorry for repeating what you already intuitively understand; just wanted to drop my basis for adding my thumbs up to the effort!)

Reply 8 of 30, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

awgamer,

The existing loop spins pretty hard; here are a couple changes that trying to mitigate that.
Also moved domix bool to a counter mix_queue, so the mix queue can pile up - when we need to run the mixer back-to-back without another wait cycle.

This can be the cases with PC Speaker PIT-mode audio such as the intro music to "Space Racer"; unfortunately even the queue doesn't fix it.. it still sounds muddy with micro-dropouts.

MIXER: mixing queue 1 MIXER: mixing queue 1 MIXER: mixing queue 2 MIXER: mixing queue 1 MIXER: mixing queue 1 MIXER: mixing queu […]
Show full quote

MIXER: mixing queue 1
MIXER: mixing queue 1
MIXER: mixing queue 2
MIXER: mixing queue 1
MIXER: mixing queue 1
MIXER: mixing queue 6
MIXER: mixing queue 5
MIXER: mixing queue 4
MIXER: mixing queue 3
MIXER: mixing queue 2
MIXER: mixing queue 1

static SDL_mutex *queue_lock = SDL_CreateMutex();
static uint16_t mix_queue = 0;

static int MIXER_Mix_Thread(void *)
{
while (1) {
SDL_LockMutex(queue_lock);
if (!mix_queue) {
SDL_UnlockMutex(queue_lock);
std::this_thread::sleep_for(std::chrono::microseconds(30));
}
else {
// LOG_MSG("MIXER: mixing queue %d", mix_queue);
--mix_queue;
SDL_UnlockMutex(queue_lock);
MIXER_MixData(mixer.needed);
mixer.tick_counter += mixer.tick_add;
mixer.needed += (mixer.tick_counter >> TICK_SHIFT);
mixer.tick_counter &= TICK_MASK;
}
}
return 0;
}

SDL_Thread *threadID = SDL_CreateThread(MIXER_Mix_Thread, "Mixer", (void *)NULL);
static void MIXER_Mix()
{
SDL_LockMutex(queue_lock);
mix_queue++;
SDL_UnlockMutex(queue_lock);
}

Feel free to take and remix as desired 😀

Reply 9 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

polling/domix was intended as a temp test to remove thread overhead as much as possible and workaound, but not really nice blasting a core at full speed, hilariously so for a light load, just to get around sdl's poor running routine, ideally would come up with something that does the same as condwait, just not implemented as badly, or put more nicely, more sensitive, quicker to react that doesn't cause the same performance drop seen in the frame rate. Maybe sdl's solution is as good as it gets but I doubt it.

What I've been testing with doesn't seem to be affected one way or another with your additions, no harm no foul, and the thought of buffering the mixing for such a pile up did cross my mind. I may not be hearing what you do since I have tinnitus, in my case I may have to take your word for it though others may be able to confirm, any other examples that highlights what it solves for?

Any particular reason to change the mutex to a pointer? Way I wrote it I just pulled from sdl examples and tutorials.

Low end would more likely benefit and threading devices is a match for real world system behavior, but dosbox audio handling is looking pretty light. Rather than just threading mixer, threading pic would match more, though more to wrap the head around and dependencies to track down and what other events pic handles are lighter still, keyboard, joystick, rtc, timer, .. picrunqueue and everything that hangs off it in one thread running like a southbridge might give an improvement, but recompiler, draw, rendering(if not already taken care of by one of the various offloading opengl/directx implementations? I'm not sure, my eyes glaze over from so many solutions, I don't have my bearings on what's what) and compiled code is where it's at.

FYI, I noticed dosidle keeps speaker playback from being warbly and studdery in testdrive(can be exhibited by cycling through the car selection using up/down arrow) with cycles at max. Can also resolve by setting cycles to a fixed amount but something to look at to improve auto cycle handling since dosidle source is included.

Reply 10 of 30, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie

You don't hear the glitches? I assume it's because SDL_LockAudio/UnlockAudio got tossed out and the mixer is writing to the audio buffers at the same time as SDL calls MIXER_CallBack, although it could also be caused by the mixing thread and the main thread touching pic stuff at the same time.

Reply 11 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

If you look at the sdl source sdl_lock/unlockaudio are just mutex locks, redundant as far my understanding. I had removed the audio locks from adding the locks for condwait handling and just kept those with polling. I had played around with removing the locks, and when removed and trampling each other it wasn't subtle. As for hearing glitches, I'm having a hard time distinguishing between what's supposed to be glitching or just the low quality playback and samples that I'm hearing.

P.S. going by system monitor, max cycles is currently configured rather conservatively, "max" being far from true max by a large margin, could update it taking threaded mixing into account.

Reply 13 of 30, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

... ideally would come up with something that does the same as condwait, just not implemented as badly, or put more nicely, more sensitive, quicker to react that doesn't cause the same performance drop seen in the frame rate.

Yup; that what I attempted at top of the loop: read the condition and if it's false then release the lock on the condition and wait a tiny bit of time (at which point, the condition is unlocked so the main thread is free to adjust it). I'm using a relatively fine-grained timer (admittedly C++11; which won't fly) to keep it quick and reactive but still almost entirely idle CPU-wise.

	SDL_LockMutex(queue_lock);
if (!mix_queue) {
SDL_UnlockMutex(queue_lock);
std::this_thread::sleep_for(std::chrono::microseconds(30));

Regarding the queue vs. bool:

The queue prevents under-generating audio, which would otherwise be lost (more specifically, N - 1 milliseconds) would be lost.

In Space Racer, the the main thread can rapidly make 6 mix calls practically back-to-back, all while the mix-thread is in the middle of a single pass. When I use a bool (instead of a queue), the mix-thread finishes up its pass, toggles the bool to back to false, and then think its job is done for this round (meanwhile it actually lost 5ms of audio). So I feel some form of queue or backlog is needed to prevent this (perhaps there are smarter ways to catch/manage this!).

I think critical work lies in the areas mentioned by jmarsh; as this is no doubt where chunks of of audio are falling off the truck.

I understand you may not hear the degradation that jmarsh and I mentioned (I've also been told I'm unable to hear some dynamic range differences between recordings.. mid-40s is no fun; close up vision is also starting to change! argh).

Highly suggest using headphones or in-ear buds, and do side-by-side testing using "RealSound" (PIC-timer-music/effect) PC-speaker games: Mean Streets, Count Down, Digger (music), Karateka, and Space Racer. At least to me, it is extremely apparent and I hope can you reproduce it.

Reply 14 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

It looked like sdl_lock/unlockaudio were being used as a generic mutex to me on internal mixer struct rather than sdl audio's obtained struct, in a sense they were, but internal mixer struct variables are being used in the sdl audio callback handler, not a problem adding them back, give the mix thread something to do, but now I want to thread the sdl audio callback handler to see what that does, as at the moment that should currently be processed in the main thread. Notably lighten main thread load? One way to find out(well, could profile to find out that way, but I digress.)

Reply 15 of 30, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie

It's already run on a separate thread by SDL, that's the reason why SDL_LockAudio/UnlockAudio exist (the callback function always runs with the audio lock owned). If you hand it off to yet another thread the audio hardware will play garbage because SDL expects the samples to be ready for playback when the callback returns.

Reply 16 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

For whatever reason this was giving me improved fps in quake. It's not threaded, basically stock. margin of error?

Attachments

Reply 17 of 30, by latalante

User metadata
Rank Newbie
Rank
Newbie

I did a few tests and didn't notice the difference.
Then I repeated them using linux perf.

220    0.05%  dosbox [.] MIXER_CallBack
261 0.04% dosbox [.] MIXER_CallBack

Yes, with this fix, MIXER is the 261 function in the performance race. Without 220.

The change from 0.05% to 0.04% cannot have a noticeable effect on the benchmark. Absolutely.

Edit:
Maybe on fast equipment with low resolution and high FPS, this difference becomes more visible. For me, everything oscillates around statistical error.

Edit2:
More accurate measurement.
perf record -e cycles:pp --call-graph dwarf src/dosbox -c 'quake.exe +timedemo demo1 +exec mode13.cfg -noipx -nolan -nocdaudio' -c 'exit' #800x600

dosbox-r4356

perf report --no-inline
Children|Self |Command|Shared Object|Symbol
0,17% 0,00% dosbox dosbox [.] MIXER_Mix
0,15% 0,02% dosbox dosbox [.] MIXER_MixData
0,05% 0,05% dosbox dosbox [.] MIXER_CallBack
0,00% 0,00% dosbox dosbox [.] MIXER_Init

dosbox-r4356 + mixer2.diff

Children|Self |Command|Shared Object|Symbol
0,14% 0,00% dosbox dosbox [.] MIXER_Mix
0,12% 0,02% dosbox dosbox [.] MIXER_MixData
0,04% 0,04% dosbox dosbox [.] MIXER_CallBack
0,00% 0,00% dosbox dosbox [.] MIXER_Init
0,00% 0,00% dosbox dosbox [.] MIXER_CallBack

Reply 18 of 30, by awgamer

User metadata
Rank Oldbie
Rank
Oldbie

krcroft, I've found that setting microseconds to 1000 is the magic number with your solution to condwait, system monitor showing the core running mix thread going from running at full tilt to using only what it needs like a sane process, with no fps hit seen from benching like with condwait, and no lag or dropped samples that I perceived, but you know the drill on that, for you guys the difference may make your ears bleed, but come on, we're talking 0.001 of a second here. Check to confirm it's not just me seeing this, but otherwise, congrats, you've built a better mouse trap.

Reply 19 of 30, by krcroft

User metadata
Rank Oldbie
Rank
Oldbie

Good stuff all round awgamer; is there a combined patch you can post that includes your performance improvement that latalante confirmed?
Regarding 1000 microseconds, in theory you could then swap in SDL_Delay(1) and avoid the C++11'isms, as both delay one millisecond.