VOGONS


8088 MPH: We Break All Your Emulators

Topic actions

Reply 60 of 136, by Scali

User metadata
Rank l33t
Rank
l33t
pietja wrote:

Does MONOTONE also support CMS on old soundblasters ?

Not yet. Currently it supports PC Speaker, OPL2 and SN76489 (PCjr/Tandy).
Adding CMS should not be very difficult though, since MONOTONE has a modular object-oriented design.
You just need to implement a few procedures to add support for a new device, see here: https://github.com/MobyGamer/MONOTONE/blob/ma … SRC/MT_OUTP.PAS

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 61 of 136, by Jepael

User metadata
Rank Oldbie
Rank
Oldbie

I see from the docs that it compiles with TP7, but is that object pascal?

Are there any instructions how to develop?

I am just thinking if I should dig out my SB with CMS chips and put it in my retro machine..

Reply 62 of 136, by Scali

User metadata
Rank l33t
Rank
l33t
Jepael wrote:

I see from the docs that it compiles with TP7, but is that object pascal?

Well, it's an object-oriented form of Pascal. It's slightly different from Delphi though.
Trixter (MobyGamer) or I can probably answer any questions you have regarding development. Trixter designed and wrote MONOTONE, and I did some updates to the code.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 63 of 136, by MicroCoreLabs

User metadata
Rank Newbie
Rank
Newbie

Hi,

I am a big fan of 8088MPH and I have developed an x86 processor core, so naturally I had to give this demo a try!

The core is called the MCL86 and it is a microsequencer based, cycle compatible, x86 soft IP core. Here is the YouTube video of it running 8088MPH on an XT:

https://www.youtube.com/watch?v=b3GkPGZR4BU

The core is cycle "compatible" but not cycle "exact" which shows up particularly during the Kefrens Bars effect. I think this is due to the core's consistant instruction timing which may make it a little faster than the 8088. When I run the core faster, the Kefrens bars start even lower on the screen, so thats why I believe it is cycle speed related.

Anyway, I hope you like the demo. It has been a goal of mine (as it probably is of many other emulator developers) to get our cores running 8088MPH which indicates just how cycle compatible our work is! 😀

I look forward to the next demo!

Thanks,
Edward

Reply 64 of 136, by Scali

User metadata
Rank l33t
Rank
l33t
MicroCoreLabs wrote:

The core is cycle "compatible" but not cycle "exact" which shows up particularly during the Kefrens Bars effect.

You also notice it in the moire effect (first effect after the title screen), where the music slows down.
Reenigne is studying the 8088 at cycle-level with a bus-sniffer and is working on a cycle-exact CPU emulator: http://www.reenigne.org/blog/isa-bus-sniffer-update/
Perhaps you could use that to fine-tune your core as well?

By the way, what video card did you use? The artifact colours seem wrong, so I don't think it's a real IBM card?

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 65 of 136, by MicroCoreLabs

User metadata
Rank Newbie
Rank
Newbie

Yes I was very interested to learn from Reenigne that the 8088 appears to add a clock for some instructions such as lea dx,[bx+di], yet does not for lea dx,[bx+si] while both cmp[bx+di],dx and cmp [bx+si],dx take the same number of cycles. My core will always add the same number of clocks for each addressiing mode no matter what instruction it is for. It was not a goal to make my microsequencer totally cycle exact; just cycle compatible so it can be used for a legacy embedded design. Most people however will probably disable this cycle compatibility mode and run the core as fast as possible...

The video card actually is a genuine IBM card but the LCD monitor I am using is exceptionally low quality that cuts in and out after almost every effect!

Reply 66 of 136, by MicroCoreLabs

User metadata
Rank Newbie
Rank
Newbie

"My core will always add the same number of clocks for each addressiing mode no matter what instruction it is for."

I should clarify that the core does not add the same number of clocks across all addressing modes. Each EA mode has a different number of clocks which the core will add no matter which x86 instruction is using it.

Reply 67 of 136, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
MicroCoreLabs wrote:

Yes I was very interested to learn from Reenigne that the 8088 appears to add a clock for some instructions such as lea dx,[bx+di], yet does not for lea dx,[bx+si] while both cmp[bx+di],dx and cmp [bx+si],dx take the same number of cycles.

That is interesting. According to this [bx+di] is supposed to be 1 cycle more than [bx+si]. So the question is more like why is not lea same as cmp and not why is there a difference between di and si as adders to bx.

But then again, the information on that link could be wrong. Does Reenigne have the information about cycles public anywhere? Looking here (his website) I cannot see that information, but maybe I am blind.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 68 of 136, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

I haven't written up the information about the cycle counts yet - I have a lot more experiments to do first.

I think the reason DI takes 1 cycle more is that it needs to set an internal flag in the CPU to use ES instead of DS. As for why this doesn't happen with CMP - I think it's masked by the fact that that instruction accesses memory (LEA doesn't, it just calculates the address). The memory accesses are on the critical path, so it sort of "smoothes out" the timing. That's just a guess, though, and there's still a lot of other things I don't understand. I imagine it makes much more sense if you put yourself in the shoes of the chip designer and actually try to write the microcode, so I might end up doing that.

Reply 69 of 136, by Scali

User metadata
Rank l33t
Rank
l33t
MicroCoreLabs wrote:

When I run the core faster, the Kefrens bars start even lower on the screen, so thats why I believe it is cycle speed related.

Yes, the Kefrens part is cycle-counted. It is interesting that you say that the faster the CPU is, the lower it starts on the screen.
I have run the demo on two clone-cards, and they both showed the same thing: the Kefrens bars started at about 25%-30% down the screen, but other than that they appeared the same.
It could be that they are just slightly out-of-sync with a real CGA card, which results in the 6845 reprogramming starting too fast... The background colour works, but reprogramming the start address does not.
The code appears to have somewhat of a self-calibrating nature, since it writes to the CGA memory at every scanline. The CGA card will insert wait-states, which will eventually push the effect into sync.

I wonder if that means that adding a tiny bit of delay at the start of a frame will make it work properly on a clone as well.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 70 of 136, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

I wonder if that means that adding a tiny bit of delay at the start of a frame will make it work properly on a clone as well.

I don't think it would. The effect may appear to be "nearly" correct on clones but the timing is still off and in this effect, being off by a small amount has knock-on effects for the entire frame. Take a close look at https://youtu.be/b3GkPGZR4BU?t=294 and you'll see a glitch in the effect every 12 scanlines or so. That means the per-scanline code is taking 5-10% longer (or shorter) than it should. So after 200 scanlines the CPU and CRTC will be out of sync by 10-20 scanlines.

Next the effect reprograms the CRTC for the vertical overscan and vsync, which is supposed to happen exactly once per frame. However, if the effect misses the critical window for reprogramming back to the "short" (2-scanline) CRTC frames used for the active area then you'll get two lots of vertical overscan before the effect proper starts again. That's why the top part of the effect isn't working - the CPU is happily changing the border colour and partying on CGA RAM but the CRTC isn't displaying the latter as it thinks it still has some overscan to finish. I think that's what's going on, anyway - it's hard to be 100% sure.

Perhaps there are changes that could be made to make the effect work better on clones (perhaps involving waiting for the vsync pulse instead of using IRQ0 for timing), but ultimately there's no room for doing any kind of timing correction in the per-scanline code (other that the waitstates caused by accessing CGA RAM) so it'll never be pixel-perfect if the timing is off in that unrolled inner loop.

Reply 71 of 136, by Scali

User metadata
Rank l33t
Rank
l33t
reenigne wrote:

However, if the effect misses the critical window for reprogramming back to the "short" (2-scanline) CRTC frames used for the active area then you'll get two lots of vertical overscan before the effect proper starts again. That's why the top part of the effect isn't working - the CPU is happily changing the border colour and partying on CGA RAM but the CRTC isn't displaying the latter as it thinks it still has some overscan to finish. I think that's what's going on, anyway - it's hard to be 100% sure.

Yes it seems like it.
But my clones are very close to cycle-exact. I know the ATi Small Wonder inserts a few extra waitstates (barely measurable with raw read/write rates). Which may be why it *just* misses that window. I think it is close enough that the per-scanline code does not drift (which you can tell by looking at the rasterbars... you shouldn't be able to see any 'stepping' anywhere on a scanline).
Before, I just assumed that the CRTC may have been incompatible. But now that we see the same behaviour on an IBM card, we can be reasonably sure that it's just a case of the CPU and the CRTC being somewhat out-of-sync, and the missing bars at the top are likely because the CRTC still thinks it's in overscan there, and doesn't enable the display yet.

Makes me wonder, is there a way to make the 'critical window' more robust? As in, you give it 1 or 2 extra scanlines, and move to a polling strategy to re-sync to the hsync, or something to that effect?

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 72 of 136, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

Makes me wonder, is there a way to make the 'critical window' more robust? As in, you give it 1 or 2 extra scanlines, and move to a polling strategy to re-sync to the hsync, or something to that effect?

Maybe... I don't have any way to test such a thing, though. You know where the source code is - if you're interested in making it a less good emulator testcase you're welcome to try!

Reply 73 of 136, by Scali

User metadata
Rank l33t
Rank
l33t
reenigne wrote:

Maybe... I don't have any way to test such a thing, though. You know where the source code is - if you're interested in making it a less good emulator testcase you're welcome to try!

Yes, I might. 8088 MPH was an interesting experiment in finding the limits of PC 'compatible' hardware. Now that we know that 'compatible' doesn't mean what they think it means... I do like to make all effects as compatible as possible, within reason. The 1024c trick simply isn't going to work, but making code less dependent on cycle-exact speeds where possible is not a bad idea.
I think the demo where we demonstrate our clone-compatible Kefrens bars should be called "Revenge of the clones", if we want to stay with movie references as a theme 😉

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 74 of 136, by superfury

User metadata
Rank l33t++
Rank
l33t++

I'm curious: Is there a recording of 8088 MPH somewhere running on an actual IBM VGA graphics card instead of CGA card (so I can compare it with output by my x86EMU emulator to verify if my VGA emulation is giving the 'correct' output)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 77 of 136, by mr_bigmouth_502

User metadata
Rank Oldbie
Rank
Oldbie

I wonder if anyone's attempted to do something like 8088 Corruption/8088 Domination... but with digitized sound over the PC speaker. 🤣 Would it even be possible, if you like used lower resolution video at a lower framerate?

BTW, I rewatched 8088 Domination the other day, and now I'm completely hooked on "Bad Apple". I was already somewhat hooked on the OPL3 version of that song, but hearing the original song and seeing the video for it again just blew my mind. (well, technically the video version is NOT the original; the original was a PC 98 FM tune, but with a significantly different arrangement 😉) It's also made me interested in trying the Touhou games despite my utter inability to succeed at difficult SHMUPS, let alone bullet hells.

Reply 78 of 136, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

I've given some thought to related problems.

The trouble with using the PC speaker for output is that there is no DMA, so you have to send the samples to the PIT yourself at regular intervals. One way to run some code at regular intervals is to use the timer interrupt. That works great, except for the fact that there is a lot of overhead to using an interrupt - at a decent sample rate, there isn't really time to do anything else at the same time on a 4.77MHz machine.

The other way to run code at regular intervals is to count cycles (as the 4 channel player in 8088 MPH does). This is much lower overhead and leaves enough time to do a significant amount of other work (like mixing 4 channels of audio). However, it's really difficult to get it working in a way that is portable to faster machines. And you have to write your code in such a way that the audio and video parts can be statically interleaved. In other words, the work done to output the video to the screen must be broken up into tiny chunks of known execution length (maybe writing around 16 bytes to VRAM at most). So this technique lends itself best to effects that involve repeating a small section of code, where each iteration takes the same amount of time.

The video rendering code in 8088 Domination is very different - it's mostly doing runs of "rep movsw" and "rep stosw" with a smattering of a few other instructions. The lengths of these runs aren't fixed - they're optimized to make the video update as smoothly as possible and avoid using too much disk bandwidth. So interleaving the PC speaker code with those wouldn't really work too well.

Perhaps if the encoder were substantially reworked to generate its output as sample-sized "chunks" then a reasonable Bad Apple could be done. The quality would suffer a lot compared to Domination, obviously (though it's hard to tell exactly how much without trying it), but it might be an interesting project.

Reply 79 of 136, by HunterZ

User metadata
Rank l33t++
Rank
l33t++
reenigne wrote:

One way to run some code at regular intervals is to use the timer interrupt. That works great, except for the fact that there is a lot of overhead to using an interrupt - at a decent sample rate, there isn't really time to do anything else at the same time on a 4.77MHz machine.

Yes, this is why Dungeon Master and Axe of Rage (aka Barbarian II) only had cool PC speaker music on their title screens.