I've given some thought to related problems.
The trouble with using the PC speaker for output is that there is no DMA, so you have to send the samples to the PIT yourself at regular intervals. One way to run some code at regular intervals is to use the timer interrupt. That works great, except for the fact that there is a lot of overhead to using an interrupt - at a decent sample rate, there isn't really time to do anything else at the same time on a 4.77MHz machine.
The other way to run code at regular intervals is to count cycles (as the 4 channel player in 8088 MPH does). This is much lower overhead and leaves enough time to do a significant amount of other work (like mixing 4 channels of audio). However, it's really difficult to get it working in a way that is portable to faster machines. And you have to write your code in such a way that the audio and video parts can be statically interleaved. In other words, the work done to output the video to the screen must be broken up into tiny chunks of known execution length (maybe writing around 16 bytes to VRAM at most). So this technique lends itself best to effects that involve repeating a small section of code, where each iteration takes the same amount of time.
The video rendering code in 8088 Domination is very different - it's mostly doing runs of "rep movsw" and "rep stosw" with a smattering of a few other instructions. The lengths of these runs aren't fixed - they're optimized to make the video update as smoothly as possible and avoid using too much disk bandwidth. So interleaving the PC speaker code with those wouldn't really work too well.
Perhaps if the encoder were substantially reworked to generate its output as sample-sized "chunks" then a reasonable Bad Apple could be done. The quality would suffer a lot compared to Domination, obviously (though it's hard to tell exactly how much without trying it), but it might be an interesting project.