VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

I want to improve the rendering speed of my VGA-compatible renderer, but currently I cannot find any ways to make it any faster:
https://bitbucket.org/superfury/unipcemu/src/ … /vga/?at=master

The vga_renderer.c 's VGA_ActiveDisplay_timing function seems to actually be the part that's most heavy on the 2.0GHz dual-core CPU.

This is the calling module, that processes all basic timings (which calls all those rendering modules seperately and combines them into the emulator):
https://bitbucket.org/superfury/unipcemu/src/ … le-view-default

It all starts at the updateVGA(timepassed) function, which is called after each emulated x86 instruction(with the time taken by the CPU to execute, in nanoseconds, since the VGA and compatible cards use their own clocks(25MHz VGA, 28MHz VGA, 14MHz CGA, MDA clock, Tseng clock extensions).

Anyone can see a way to improve speed on slow(er) processors (relatively slow, since you'll need a 4.0GHz Intel i7-4790K CPU to even run at 100% full speed with the current optimization)?

Eventually I want it to be fast enough to run at decent speeds on a 333MHz PSP CPU(is this even possible with such an slow CPU, while keeping the cycle-accurate emulation model?).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 5, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

It all starts at the updateVGA(timepassed) function, which is called after each emulated x86 instruction

I think that's likely to be your problem right there. Does it get faster if you call it only when absolutely necessary (i.e. at the end of the frame or when the CPU executes an instruction that accesses video RAM or registers)?

For video stuff specifically, there are also likely to be big gains from moving stuff from the host's CPU to its GPU - not sure if that is an option in your case.

Of course, you should never do any performance optimization without looking at the program with a profiler first, otherwise you don't know where to concentrate your efforts.

Reply 2 of 5, by superfury

User metadata
Rank l33t++
Rank
l33t++

It currently updates it's state whenever a timeout occurs (when the time emulated (which is kept in a simple nanosecond double floating point number) exceeds a pixel clock's time (e.g. every 1/28Mth at 28MHz pixel clock, so every 0.03ns emulated). It currently takes the time the CPU emulated (total CPU clocks executed for the current instruction, converted into nanoseconds) and adds it to it's own remaining spent time up to that point. Once it detects that the time it's accumulated exceeds(or equals) 300ns(when running at 28MHz), it divides the number it's accumulated(and stores the rest of the operation back into the time accumulator for future timing) by 300nS to get the amount of pixel clocks it needs to update. It then executes that amount of pixel clocks(every pixel clock either draws a pixel on the screen(when applicable: it depends on the current state of the CRT(active display/overscan. retrace state etc.)).

Although this is relatively CPU-heavy to execute, it isn't the part that spends the most time(according to the Visual Studio profiler): the most time is spent with updating RAM locations etc.(loading the next data from VRAM(4 bytes of data loaded every character clock(depending on (S)VGA settings))).

Is there any way to speed it up? Or is that practically impossible without breaking accuracy?

Btw, I've noticed there's a discussion out here also that says there's no CRT emulators out there, but isn't a CRT only a simple beam tracing from left to right, top to bottom on the screen drawing pixels at a specified speed? So essentially any emulator creating display this way is an 'CRT emulator'? Although the actual way seperate pixels are handled is kind of simplified (sets of 3 red/green/blue 'pixels' (at different angles, depending on the monitor) lighting up at different strengths to form a pixel. Most emulators(including mine) only show the result of those 3 R/G/B pixels becoming one pixel. Although I've yet to see any emulator going as far as mine, trying to accurately plot pixels by the clock&pixel(Most emulators only draw entire screens or lines, directly from VRAM, which may or may not be accurately timed cycle-accurate).

Btw, the only part my emulator does whole screens and lines at a time is converting entire data lines to RGB display(only with CGA/MDA) and scaling screens to display every frame(this is handled by UniPCemu's GPU core, which converts the rendered screens entire screens at a time into the correct display resolution(what normally happens by stretching the rendered display across the screen by the monitor itself, depending on the retrace signals). So essentially the VGA(and it's CRTC emulation) contains the RAM logic and rendering logic(rendering the (S)VGA/CGA/MDA RAM into pixels at a specified dot clock rate), while the GPU core processes those frames at every vertical retrace and converts it into the display the user sees at it's proper aspect ratio and resolution(which depends on the emulated monitor and settings(in this case a CGA(special custom resolution found at one of Reenigne's blog articles), 800x600(VGA), 1024x768(SVGA) or 1920x1080(SVGA) display). Although the virtual display(the one the VGA is rendering to) is currently limited to 2048x2048 pixels(which is more than enough to contain frames up to fullHD, as rendered by the current (S)VGA rendering).

The rendering process runs at a constant rate(up to 28M pixels each second when using the VGA, higher rates can be selected on the Tseng video cards). It essentially draws blocks of pixels(or moves the beam) at that rate, although it will move small blocks when the CPU is running slower than the (S)VGA(e.g. if the CPU executes an instruction with a time that is a multiple of the VGA clocks, it will spend that many VGA clocks in a loop before executing the next instruction(only dividing one time)). Although each clock is still handled seperately(e.g. each clock does all work the VGA needs to do that clock, updating states etc. before moving to the next clock).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 5, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

the most time is spent with updating RAM locations etc.(loading the next data from VRAM(4 bytes of data loaded every character clock(depending on (S)VGA settings))).

That doesn't seem like it should be a bottleneck to me. What is it doing that makes this slow? Is there a complicated calculation to get VRAM address from current raster beam position or something like that? If so, perhaps this could be done incrementally instead. Otherwise, take a look at the generated code for the hot part and see if you can figure out how to make it faster.

superfury wrote:

Btw, I've noticed there's a discussion out here also that says there's no CRT emulators out there,

Yeah, I there are lots of CRT emulators (I've written a couple myself) - it's just that some are more accurate than others in terms of what effects they emulate.

superfury wrote:

but isn't a CRT only a simple beam tracing from left to right, top to bottom on the screen drawing pixels at a specified speed?

As always, the devil is in the details. Depending on the fidelity you want to achieve, you might want to take into account such effects as (roughly in order of difficulty) horizontal and vertical phase-locked loops, scanlines/focus, bloom, aperture/slot mask, curvature and phosphor persistence.

superfury wrote:

the actual way seperate pixels are handled is kind of simplified (sets of 3 red/green/blue 'pixels' (at different angles, depending on the monitor)

The beam angle is (I think) one that doesn't actually matter - there really isn't any visible correlation between the angle that the electron beam hits the phosphors, and the light that results. CRTs have very wide viewing angles and no colour variation with viewing angle to speak of.

superfury wrote:

Although I've yet to see any emulator going as far as mine, trying to accurately plot pixels by the clock&pixel

I think there are quite a few. Off the top of my head, I think MAME and PCem/86Box try to do this for PCs. Most emulators for BBC Micro and C64 (and probably quite a few other micros) will do this since it's necessary for important games and demos. And of course it's absolutely essential for any Atari 2600 emulator.

Reply 4 of 5, by superfury

User metadata
Rank l33t++
Rank
l33t++

PCem you said? Looking through the (S)VGA sources, I've found:
https://bitbucket.org/pcem_emulator/pcem/src/ … le-view-default

When I look at those rendering functions, those don't seem accurate at all. They draw entire scanlines in one bulk, instead of timing each pixel seperately. My emulator does draw them one at a time(timed by CPU clock cycles or timing, depending on the setting in the Settings menu).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 5, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

PCem you said? Looking through the (S)VGA sources, I've found:
https://bitbucket.org/pcem_emulator/pcem/src/ … le-view-default

When I look at those rendering functions, those don't seem accurate at all. They draw entire scanlines in one bulk, instead of timing each pixel seperately. My emulator does draw them one at a time(timed by CPU clock cycles or timing, depending on the setting in the Settings menu).

Ah, I guess it's just cycle-accurate (modulo CPU cycle-accuracy) for CGA. Unlike your emulator, PCem has completely separate CGA and SVGA code. There's not much point in making SVGA cycle-accurate as raster tricks are much less useful there.