HOWTO: Possible Lagless VSYNC for Emulator Devs (implemented in WinUAE/etc), via beam-raced tearingless VSYNC OFF \ VOGONS

HOWTO: Possible Lagless VSYNC for Emulator Devs (implemented in WinUAE/etc), via beam-raced tearingless VSYNC OFF

Topic actions

First post, by mdrejhon

Posted on 2023-07-02, 23:28

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

(I'm the founder of Blur Busters and TestUFO. I was making a reply to author of MartyPC emulator, but realized this was worth its very own thread topic)

PURPOSE OF POST: Since there isn't an ultra-lowlag DOS emulator yet, this HOWTO is for PC/DOS emulator developers currently implementing beamraced graphics-adaptor engines (e.g. MartyPC), considering synchronizing the emulator raster to the real displays' raster. Such algorithms can reduce the input lag of an emulator to become nearly identical to an original machine.

Reply to MartyPC emulator developer:

GloriousCow wrote on 2023-07-01, 18:14:
mdrejhon wrote on 2023-07-01, 10:24:

Some emulators (Toni's WinUAE and Tom Harte's CLK) now use my findings to create a new lagless VSYNC mode, with sub-refresh latencies, by Present()ing frameslices during VSYNC OFF mode later blocks of scanlines to the GPU even while the GPU is already video-outputting earlier scanlines, from an invention of mine called "frameslice beam racing"

This would be extremely cool to try to implement. Right now I'm double-buffering so I know there's a bit of latency, although at 60fps with most old PC titles it's probably not much of a concern(?) But since I'm already drawing the CGA in a clock by clock fashion there's nothing technically stopping me from sending as many screen updates as I want whenever, but I have to see if it would be possible with wgpu/Vulkan. I'm sort of at the mercy of that library, unless I port to another front-end. Can it be done through SDL?

If you can access at least these:

VSYNC OFF tearing (tearlines = raster)
Vertical Blanking timings OR raster poll
Microsecond counter (RTDSC, getMicros(), curTime.tv_usec)
Ability to time reasonably accurately (e.g. CPU busywait loop before frame present+flush).

Then yes. Any API -- Vulkan, DirectX, OpenGL, SDL, etc. Yes, you can "Atari TIA" those APIs!
A bit of fun unsolvable raster jitter (due to timing imperfections of modern computing) but still recognizable beam racing feats like Kefrens Bars.

A raster poll (e.g. D3DKMTGetScanLine() on Windows systems) is useful -- But you can skip it by using time offsets between vertical blankings. Your raceahead safety margin can handle the error margin of not knowing exact scanline number. There are VSYNC listeners available on most platforms (Linux, Android, PC, Mac), which can be used to create the timestamps necessary to create the "estimated raster" to allow you to control the exact position of a VSYNC OFF tearline (YouTube video example) to be near the real-world raster.

Doesn't matter what GPU graphics API, as long you have VSYNC OFF, as you can quickly flush your simple frame out, at a precise CPU-busywaited moment, during high/realtime priority mode. You MUST flush, to avoid the asynchronous behaviors of a GPU. The RTDSC timestamp of return from a flush, creates a tearline almost perfectly time-aligned (mere 4 pixel rows) to the real world raster! My GeForce RTX 3080 is more heavily pipelined, and Windows 11 a little less realtime than Windows 7 was (when Tearline Jedi first ran), so I'm weirdly getting fewer frameslices per second on my RTX 3080 than my GTX 1080 Ti -- there is a high overhead cost of trying to do 5000-8000fps of 1-pixel rows (vertically streched between two tearlines).

In treating a GeForce like an Atari TIA, I succesfully got Kefrens Bars working on both OpenGL and DirectX, in MonoGame engine, in C# programming. Here's a video:

Every single "pixel row" of the Kefrens bars is a VSYNC OFF tearline! This screen is impossible to screenshot, if you hit PrintScreen, you only get one pixel row of the kefrens bars, vertically stretched full screen height.

(Nomenclature here, "VSYNC OFF" doesn't remove the VSYNC from signal. It simply means to tell the graphics API to ignore VSYNC when timing the output to video port.)

Watch the crazy fun raster jitter when I move the window.

For Kefrens, I had to intentionally show every single tearline (8000 tearlines per second) to do it.

But for emulators, you intentionally hide the tearing. You simply keep the tearing AHEAD of the realworld raster, so you have a perfect tearingless VSYNC OFF mode that looks like VSYNC ON.

Slow systems (e.g. GPU on a Raspberry Pi) may only be able to do low frameslice counts (e.g. 240fps -- 4 frameslices for 1/4 refresh cycle lag). But subrefresh latency on a Raspberry pi is amazing to achieve.

Fast systems (e.g. NVIDIA or Radeon GPU) can easily do dozens of frameslices per refresh cycle, sometimes hundreds! My GTX 1080Ti could do about 100-125 frameslices per refresh cycle, enough for a true-raster Kefrens Bars demo. It's crazy to treat a 2023 GPU like an Atari TIA in C#!

This algorithm is already implemented in WinUAE Amiga Emulator, as the "lagless vsync" option.

The algorithm is very highly scalable, depending on how performance-jittery your platform is.

Slow systems or software with lots of background software, will need larger raceahead margins / taller frameslices.

Fast systems with precise behaviors (clean Windows 10 installs with certain GPUs) can do amazing sub-millisecond latency Present()-to-GPU-signal!

It'd be cool to someday add to RetroPie for low-spec machines, even if it can only do 2-4 frameslices per refresh cycle -- that's still subrefresh latency without RunAhead (which is very CPU-hungry).

I have a ~$500 BountySource to add this to RetroArch: https://github.com/libretro/RetroArch/issues/6984

Emulator developers don't usually know how to beam-race modern GPUs, due to the massive variances of end-user systems. But you can make it a toggleable ON/OFF option, so people who get too many glitches, can simply keep it turned off.

Best Pratices for Emulator Developers (new findings as of 2023)

Only works if you can get (A) intentional tearing, (B) raster poll or estimate the raster as offsets between vsync, and (C) precise timing via microsecond counter.
For Windows, you typically need full screen exclusive + VSYNC OFF, in order to get intentional tearing.
For estimating a raster poll as offsets between vsync's, you need to find a reliable way to get fairly accurate vsync timestamps, whether via a CPU thread, a listener, a VSYNC estimator that averages over several refresh cycles (and ignores missed vsyncs), a startup VSYNC ON-listener that goes into deadreckoning mode when switching to VSYNC OFF, etc. Many ways to estimate a raster poll in a cross-platform manner (platform-specific wrappers).
Make it configurable (on/off, frameslice count, raceahead margin, etc), perhaps conservatively autoconfigured by a initial self-benchmarker.
Use CPU busywaits, not timers. Timers aren't always precise enough.
Always flush after frame presentation, or make flush configurable. GPUs are pipelined, so you must flush to get deterministic-enough behaviors for beam racing. The return from the flush will be approx raster time aligned to tear line, and jitter in return timestamps can be used to guesstimate rather jitter (and maybe automatically warn/disable beam racing if it always massively jitters too much)
Detect your screen refresh rate and compensate. If your VSYNC dejitterer has a refresh rate estimator, use that! (Examples of existing VSYNC-dejittering estimators include https://github.com/blurbusters/RefreshRateCalculator and https://github.com/ad8e/vsync_blurbusters/blo … /main/vsync.cpp ) Usually it's best that the refresh rate is the same as the refresh rate you want to emulate. If there's a refresh rate mismatch, you can fast-beamrace specific refresh cycles (by running your beamraced emulator engine faster in sync with the faster refresh cycles). WinUAE does this on 120Hz, 180Hz and 240Hz (NTSC) and 100Hz, 150Hz, 200Hz (PAL). At 120Hz, emulator-beamracing every other refresh cycle (so emulator is idling/paused every other refresh cycle).
(OPTIONAL) Detect your current screen rotation and make sure realworld scanout direction is same as emulator scanout direction. Signal is only top-to-bottom in default rotation. So if you are an arcade emulator, and you want to frameslice beamrace Galaga, you will have to rotate your LCD computer monitor for subrefresh latency. There's display rotation APIs on all platforms that you can call to check. For legacy PC emulator developers, this will be largely unnecessary as it's always top-to-bottom scanout direction until the first version of Windows that supported screen rotation. So simply disable "lagless vsync" whenever you're not in default rotation.
(OPTIONAL) Detect whether VRR (FreeSync/G-SYNC) is enabled, and warn/turn off/compensate beamracing if VRR is enabled. If VRR is enabled, you can disable beamracing, or simply interrupt your VRR refresh cycle via VSYNC OFF. You can still do multiple emulated refresh rates flawlessly via 60Hz and 70Hz via VRR and still beamrace-sync VRR refresh cycles (with some caveats). WinUAE is currently the only emulator successfully beamracing GSYNC/FreeSync. For developers not familiar with how a VRR display refreshes -- it is not initially intuitive to developers who don't fully understand VRR yet. If you choose to beam-race-sync (lagless vsync) your VRR refresh cycles, you need "GSYNC + VSYNC OFF" in NVIDIA Control Panel. What happens is that the first Present() (while the display is idling waiting for new refresh cycles) triggers the display to start the refresh cycle, and subsequent sub-refresh Present() will tearline-in new frameslices if the current VRR refresh cycle is scanning-out. The OS raster polls (e.g. D3DKMTGetScanLine) will also still work on VRR refresh cycles. I do, however, suggest not trying to make your "lagless vsync" algorithm VRR compatible initially -- start with the easy stuff (emuHz=realHz) and iterate-in the new capabilities later.
Turn off power management (Performance Mode), or warn user if battery-saving power modes are enabled.
GPU drivers will automatically go into power management if it idles for a millisecond. This adds godawful timing jitters that kills the raster effects. Don't use too-low frameslice counts on high-performance GPUs, you only want hundreds of microseconds (or less) to elapse from a return from Flush() before the next Present(). If you must idle the GPU a lot, then prod/thrash the GPU with changed frames (change a few dummy pixels in an invisible area (e.g. ahead of the beam) about 1-2 milliseconds before your timing-accurate frameslice). So raster effects with only a few splits per screens (e.g. splitscreens) will perform timing-worse than raster effects that are continuous (e.g. Kefrens), due to the mid-refresh-cycle power management a GPU is oding.
CPU priority, application priority, and thread affinity helps A LOT. "REALTIME" priority is ideal, but be careful not to starve the rest of the system (e.g. mouse driver = sluggish cursor). "HIGH PRIORITY" for both process and for the beamrace present-flush thread, is key.
It's best to overwrite a copy of the previous refresh-cycle emulator framebuffer with new scanlines (and frameslice-beamrace a fragment of that framebuffer), rather than blank/new frame buffer. That way, raster glitches from momentary performance issues (like a virus scanner suddenly running, and beam racing too late) will show as simple intermittent tearing artifacts instead of black flickers.
The first few scanlines of a refresh cycles are pretty tricky because of background processing done by desktop compositors systems like Windows dwm.exe. I can't ever get a tearline to appear in the first ~30 scanlines of my GeForce RTX on Windows 11. Factor this in, possibly by using larger raceahead margin, or taller frameslices, or variable-height frameslices (with taller for top of screen).

(P.S. On a related "beam racing topic" - if anyone wants to create the world's first crossplatform rasterdemo for a future democomp (superior to my incomplete WIP Tearline Jedi), I'm happy to help out, as long as I'm in the credits. Though, it's tough if you can't predict the demo platform, and you don't have much time to do impressive graphics when turning a GPU into a TIA, especially at only 25-50-100-200 "pixel rows" in Kefrens, depending on how performant the platform is. But this is crossplatform Kefrens!)

Hope my tips helps your emulator development!

I must admit though, input lag on an IBM 5150, especially in a demoplayer emulator, is probably not a high consideration. However, if your emulator veers into a wide variety of emulation (e.g. Keen, Wolfenstien 3D and DOOM), then you should consider adding this lagless vsync mode. A lot of fast-twitch games (FPS, fighting games, pinball, etc) really benefit from these latency reductions.

Then you get to toot your horn as the "world's first subrefresh-latency DOS emulator!" and "Emulation-to-photons within 1ms of original machine latency!"

Last edited by mdrejhon on 2023-07-14, 01:09. Edited 6 times in total.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 1 of 20, by Rincewind42

Posted on 2023-07-03, 23:09

Rincewind42 Offline

Rank Member

Rank: Member
Posts: 137
Joined: 2022-03-29, 01:42

Good stuff; I poked the VICE guys about this already to add it to their Commodore home computer emulation suite.

Yes, in WinUAE this works like a dream. With a frameslice of 8, I'm getting hardware-like latencies which makes a huge difference in fast-paced action games. A latency more than about 5ms is especially noticeable in pinball and break-out/Arkanoid style games, making these games next to unplayable for me before the introduction of lagless vsync. This was the last piece of the puzzle; now with a good CRT shader setup I'm getting an authentic, hardware-like Amiga experience (so much that I wouldn't even bother with hardware anymore).

Not trying to shoot this down or anything, just noting that the effective latency in fast-paced games in DOSBox is already quite low. I'm guessing that's because the emulation progresses in 1ms chunks, but I haven't messed around in that area of the code to understand the exact reason. Nevertheless, Arkanoid-style games are *very* responsive and playable in DOSBox even with bog-standard vsync enabled (in DOSBox Staging), which is not the case in WinUAE and VICE with the standard vsync.

Those are some very nice collection of implementation tips, maybe I'll give a go at introducing this to DOSBox Staging at some point. Just the payoff for me is rather low because the latency seems good enough already, plus I don't care much for non-C64/Amiga action games; I almost exclusively play RPGs and adventure games on DOS.

DOS: Soyo SY-5TF, MMX 200, 128MB, S3 Virge DX, ESS 1868F, AWE32, QWave, S2, McFly, SC-55, MU80, MP32L
Win98: Gigabyte K8VM800M, Athlon64 3200+, 512MB, Matrox G400, SB Live
WinXP: Gigabyte P31-DS3L, C2D 2.33 GHz, 2GB, GT 430, Audigy 4

Reply 2 of 20, by mdrejhon

Posted on 2023-07-04, 11:15

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Rincewind42 wrote on 2023-07-03, 23:09:
Those are some very nice collection of implementation tips, maybe I'll give a go at introducing this to DOSBox Staging at some point. Just the payoff for me is rather low because the latency seems good enough already, plus I don't care much for non-C64/Amiga action games; I almost exclusively play RPGs and adventure games on DOS.

Admittedly, very few DOS games use any form of beam racing so it is quite possible benefits may be more marginal than for other 8bit and 16bit computers/consoles that frequently use beam racing.

However, MartyPC did ask, and this post is a useful future knowledge base for posterity for whomever wishes to partake in this feat.

Surge-executing emulation right before a refresh cycle is another way to reduce latency. This is called "input delay" technique, even if it is not raster-accurate timings. This may be more effective for most DosBox apps, and simpler, it is simply idling after a frame presentation. One random example is, that if 1/60sec of 4.77MHz DOS emulation can be emulated in 3ms, you simply delay for approx 10ms-12ms after the last 70Hz refresh cycle (if display is configured to 70Hz custom refresh rate mode), then execute the emulation slice of time (in 3ms), and now you have a frame ready to present to a very imminent fixed Hz refresh cycle. Very low lag!

Some emulators does this as a configurable number for input delay. I am not sure if DosBox already has this. Now that being said, if you have a VRR display, you are already effectively doing this, as frame presentation immediately refreshes the display. in VRR (FreeSync/GSYNC) the display syncs to the frame presentation, rather than the other way around. If you play DosBox in full screen via VRR, latency will likely already be deliciously low.

Nontheless, lagless VSYNC techniques (frameslice beamracing) could benefit any games that utilize any form of beam racing (like Lemmings split screen in some modes?), especially if system is not fast enough to surge-execute emulation frames much faster than the original machine (on a per refresh cycle basis).

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 3 of 20, by Rincewind42

Posted on 2023-07-04, 13:40

Rincewind42 Offline

Rank Member

Rank: Member
Posts: 137
Joined: 2022-03-29, 01:42

mdrejhon wrote on 2023-07-04, 11:15:

Some emulators does this as a configurable number for input delay. I am not sure if DosBox already has this. Now that being said, if you have a VRR display, you are already effectively doing this, as frame presentation immediately refreshes the display. in VRR (FreeSync/GSYNC) the display syncs to the frame presentation, rather than the other way around. If you play DosBox in full screen via VRR, latency will likely already be deliciously low.

Well, one of my colleagues has implemented various frame presentation modes and VRR support in DOSBox Staging, but I admit the whole thing does my head in a bit, I don't fully grasp the subtler details even after repeated attempts... It all started out as a feature to ensure more even frame-pacing in FPS games on non-VRR monitors that cannot even do fixed ~70Hz but only 60. All I know is that with a custom ~70.086 Hz refresh rate with simple vsync enabled I'm getting WinUAE lagless vsync level ultra-low latencies in fast Arkanoid-style games and Pinball Dreams in DOSBox Staging already (in addition to buttery smooth scrolling). Whereas Arkanoid or any action game in VICE is next to unplayable for me, it's so laggy... One of the reasons I still need my hardware C64 😎

DOSBox Staging has another interesting feature where you can force the DOS refresh rate to any rate. It does not work with everything, but it does with Build engine games AFAIK, so you can effectively brute-force a 1ms refresh rate by setting it to 1000Hz, then you can combine this with the VFR presentation mode for fixed-rate displays, or just use it with a VRR monitor. One of our users who has a nice VRR monitor is super into this and has written a very detailed guide, it might be relevant to your interests:
https://github.com/dosbox-staging/dosbox-stag … esh-rate-gaming

Reply 4 of 20, by mdrejhon

Posted on 2023-07-05, 02:44

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Rincewind42 wrote on 2023-07-04, 13:40:
Well, one of my colleagues has implemented various frame presentation modes and VRR support in DOSBox Staging, but I admit the whole thing does my head in a bit, I don't fully grasp the subtler details even after repeated attempts...

As the resident expert here on "present-to-photons"...

...Let me attempt my method of explaining Variable Refresh Rate (VRR, FreeSync, G-SYNC).

1. Understanding The Scanout (non-VRR too) -- Sequential Refreshing From Top-to-Bottom

Firstly, let's remember all displays scan out sequentially (as a serialization of 2D image over a 1D cable or broadcast). Most also refresh sequentially "off the wire". This is still true for digital, even after analog era. Not all pixels refresh at the same time, see high speed videos at www.blurbusters.com/scanout

This has been true since the 1920s Farnsworth and Baird prototype TVs, but this is still true for 2020s high-Hz displays and DisplayPort cables too! In fact, it's still true for VRR too, as it is for fixed-Hz! It's amazing that raster scanout for displays is practically a century old -- we've been doing things the same way for ages. This Is The Way. Logical serialization of 2D data into 1D -- just like the reading order of a book, pixels (metaphorical letters) are delivered left-to-right, top-to-bottom. Pixels are delivered over a video cable like reading a book page (one refresh cycle) -- over a video cable (analog or digital) -- and typically refreshed onto the screen in the same sequence.

It's also simpler for electronics to refresh serially instead of in parallel. Good low-latency displays do an additional optimization to refresh synchronously -- video cable scanout is typically matched to display panel scanout. Good gaming LCDs can use rolling-window processing (a few scanlines for processing like scaling/etc) to refresh without buffering a full frame, to keep latency low between signal and panel.

Here's a diagram of how a typical non-VRR 60Hz display refreshes (whether a NTSC TV or a 1080p 60Hz or a 4K 60Hz signal and TV), 60Hz 60fps VSYNC ON via non-VRR

On a CRT and ultra-low-lag LCDs/OLEDs, this is both signal scanout AND screen scanout. (The scanouts are not necessarily synchronized, if the display buffers the signal to process it fully first, before doing its own custom refreshing pattern)

However, it is important to understand the old fashioned way of doing things before we talk about VRR. Make sure you understand the above, and the high speed videos at www.blurbusters.com/scanout -- this is a pre-requisite before I can successfully explain VRR behavior.

Now, mapping this out in a signal structure, this is what a signal looks like (both analog and digital):

Regardless of whether analog or digital,
Horizontal Sync (in the signal) tells the display to start a new pixel row (scanline).
Vertical Sync (in the signal) tells the display to start a new refresh cycle.

The porches are just padding (overscan), formerly to help the beam re-accelerate before showing picture data, but are just essentially paddings (like comma separators) in a digital signal. This was useful in older digital displays to give more time for the scaler chip in displays to prepare to accept a new pixel row, or to finish displaying the previous pixel row onto its own panel. While they're mostly vestigal now (waste of bandwidth), they're still there with a 1:1 temporal symmetry between analog and digital. This signal structure has been used for almost 100 years -- and is true in both analog and digital domains. This is why a bufferless 1:1 VGA-to-HDMI adaptor works, it's temporally the same pixel timing to the microsecond. Left-to-right, top-to-bottom, like a book, NTSC, PAL, SECAM, RGB, Component, VGA, DVI, HDMI, DisplayPort. It's the same way for a century -- a natural serialization of 2D picture data from a 1D wire (or broadcast).

Why is there lots of frame presentation lag on fixed-Hz displays? That's because of buffering at the software/GPU side. (e.g. "double buffering" found in VSYNC ON). Frame presentation in software won't always be aligned to the exact millisecond of the display/GPU's autonomous fixed-Hz refresh cycles, so the drivers (3D API) are forced to buffer the frames (input lag! delay!) until the display is refreshing. So a frame Present() will be lagged to at least the next refresh cycle (or two), depending on how many frames are buffered. Now you understand why there's lag on fixed-Hz displays!

(From a driver perspective, "VSYNC OFF" is simply nomenclature to ignore the signal VSYNC when presenting the frame. There's always a VSYNC pulse in the display signal regardless).

Make sure you understand this section, in order to understand how VRR is such a minor modification to the Old Way of Doing Things. In particular, see an example LCD rotating between 4 images (one per refresh cycle), at https://www.youtube.com/watch?v=rOyoOm4o82M ... LCDs still refresh sequentially top-to-bottom like a CRT at regular intervals (except without flicker). Some complex-refresh or multi-refresh displays (plasma, DLP) will buffer the sequential scanout from the cable and do their "refreshing special sauce". But most displays (LCD, OLED, MicroLED) refresh sequentially, usually top-to-bottom similarly to the order of pixels coming in via the cable. This is called the "scanout".

2. Now Understanding How VRR Refreshes

VRR displays allows you to have variable-time intervals between refresh cycles. Instead of the GPU/display autonomously refreshing on a traditional fixed schedule, the display scanout actually idles and WAITS for the software!

The exact millisecond that the software presents a frame (e.g. DirectX Present() API to indicate a completely finished / fully rendered frame), the GPU instantly begins transsmitting the pixels out of the video output, without waiting at all. No fixed schedule. As long as time intervals between frame presentation is within the refresh rate of the display, the display changes the refresh rate to match the frame rate perfectly, at a low-latency between presentation and actual scanout.

This diagram indicates what happens to the GPU output, after the application Present()'s a finished frame:

Another bonus is that all frames are delivered very FAST over the video cable (always at max Hz velocity from first to last pixel). So a 240Hz G-SYNC display means those "60fps 60Hz" frames (refresh cycles) are transmitted over the video cable in 1/240sec each. Low-lag 240Hz displays may (sometimes) be terrible at 60Hz fixed-Hz, but if you configure them to VRR, something magical happens -- whatever frame rate you spew from the software -- becomes the refresh rate on the display. And each individual frame is transmitted faster over cable, and refreshed faster onto the panel (unlike old 60 Hz fixed-Hz modes and old 60Hz displays where it took 1/60sec to display the first thru last pixel).

This is why 60fps emulator lag is lower on a 240Hz VRR display than on a 144Hz VRR display, because those "60Hz" frames are scanned out over the video cable in 1/240sec, and scanned out onto the screen in 1/240sec. Yes, this means the display is idling more between refresh cycles (a vertical blanking interval 3x+ bigger than visible image)

The bonus of a high-refresh rate VRR in emulator latency reductions are caused by two things at the same time.
(A) Display refreshes immediately upon frame presentation (so you don't need input delay techniques); and
(B) Scanout is very fast (maximum Hz). Those "70Hz" refresh cycles refresh in just 1/165 sec (on a 165Hz VRR) or 1/240 sec (on a 240 Hz VRR).

The two reasons (A) and (B) is why emulator latency is so fantastically absurdly low on high-Hz VRR displays.

Sometimes emulation on VRR can sometimes beat original-machine latency! This is despite emulator overheads, due to 1/240sec being over 12ms faster than 1/60sec frame transmission to display. The turtle (emulation lag and digital lag) beats the rabbit (CRT and original machine), simply because of sheer faster frame transmission (over cable) possible on the latest high-Hz displays. Despite slightly delayed frame transmission start. The lack of delay between frame presentation and refresh -- AND -- the faster frame transport over video cable / faster scanout. You need simple brute refresh rate (240Hz or higher) and all the VRR optimizations (immediate refresh cycle behavior), to get the massive lag decrease necessary to allow an emulator to actually beat the original machine's slow 60Hz scanout.

Now you're finally getting it (I hope).

3. If you understand blanking intervals and display scanout

Raster variable-refresh displays are a clever minor modification to scanout. It's just a variable-count blanking scanlines spacering apart the refresh cycles. The horizontal scanrate is unchanged, but the display is idling (GPU is usually transmitting an endless loop of blanking scanlines above the top edge of the screen). This keeps the display's "beam" idle (paused) above the top edge of the screen.

If you lived through the 1980s or earlier, you remember the black bar on a rolling picture on old TVs. It's like a variable-height VHOLD black bar (as a variable temporal spacer between refresh cycles):

So for those Gen-X-ers and older who understand this (analog TVs that rolled until you adjusted VHOLD) .... VRR is just merely a variable-height VHOLD black bar -- as the method of temporal-spacing apart refresh cycles -- in order to have permanent framerate=Hz during varying frame rates.

The magic is if you do 70fps in your software -- the display is automatically 70Hz, because the display is refreshing whenever the GPU is finished with a frame!

Did You Know? This is not exclusive to digital. In fact, certain MultiSync CRTs is compatible with FreeSync (via AMD GPUs, with ToastyX CRU to force tight 800x600 56Hz-75Hz FreeSync range, and then piped through a HDMI-to-VGA adaptor). Not many MultiSync CRTs tolerate raster VRR since it's literally 100 refresh rate changes per second (for a varying frame rate content). But the way FreeSync does refresh rate changes so amazingly "gently" (unchanged horizontal scanrate, otherwise unchanged refresh cycle timings, just a variable-line-count vertical blanking interval, and only in Back Porch -- essentially the top overscan). That it sometimes does not trigger a mode-change blackout on certain models.

VRR's realtime dynamic amount of adding/removal of scanlines from Back Porch (GPU of most VRR technologies just transmit endless loop of dummy blanking-interval scanlines -- dynamic realtime variable count Vertical Back Porch). If you have, for example, a 67.5 KHz horizontal scan rate (67500 pixel rows per second including blanking), a new VRR refresh cycle can start anytime at a 1/67500sec granularity. The GPU is simply endless-looping blanking scanlines, until the software/drivers deliver a frame, and then the first scanline begins transmitting right away!

This is how most VRR technologies work (FreeSync, "G-SYNC Compatible", VESA Adaptive-Sync, HDMI VRR, etc), since it's such a minor modification to existing displays: The magic of varying the intervals between refresh cycles.

4. Why does VRR reduce stutters?

While VRR is used for low-lag for emulators (it's an accidental benefit), I also wanted to touch upon why VRR reduces (or even eliminate) stutters for modern varying-framerate PC video gaming content.

VRR can also eliminate erratic stuttering caused by varying frame rates (see demonstration animation at www.testufo.com/vrr simulated via an interpolation-like technique in Javascript -- where 30fps can smoothly change to 60fps -- without stutters). This is VRR's raison d'etre; to reduce stuttering of varying-framerate content in modern gaming. But it's also the world's lowest latency "not VSYNC OFF" sync technology, which makes it fantastic for any content requiring a VSYNC ON style mode (framerate=Hz requirement).

Traditionally, you had stutters if you didn't have framerate=Hz in VSYNC ON (non-VRR):

However, you can have a stutter-free varying frame rate with VRR:

You can clearly see this behavior in frame-rate ramping animations such as www.testufo.com/vrr

As modern 3D games fluctuate in frame rate, they render to their dynamic timestamps (of when the frame renders). So when a game is properly designed, gametime "breathes" correctly with frame timestaps, and refresh timestamps, and thus photontime (time of photons hitting eyes). The goal is that Presentation-time : Refresh-time : Photon-time stays a constant time (in milliseconds) whenever you are within VRR range, as a result (...ideally...). So there's a deterministic exact time between frame presentation and photons hitting eyes -- when it comes to VRR. Ideally. (There could be a millisecond of jitter, but that's vastly better than 1/60sec jitter = 16.7ms stutter).

Some error margins can appear, such as varying rendering-times, and/or some flawed game time-handling, and/or duress (disk access, freezes), can still inject stutters into VRR. So it's not a 100% solve-all. But it certainly greatly reduces the stutter-transition of framerate changes -- allowing display refresh rates to organically 'breathe' with frame rate -- with seamless refresh-rate changes on a per-single-frame basis.

It's crazy neat literally having over 200 completely-invisible display mode changes per second on a 240Hz monitor. Perfectly seamless and blankoutless. Each frame is its own unique refresh cycle and unique refresh rate. Even when frame rate varies.

5. Metaphor: The Fixed-Hz Scheduled Bus, versus the VRR Imediate Taxi

It even helps fixed frame rates too! Fixed frame rates can have occasional frames "1ms too late". This will cause a stutter during VSYNC ON (16.7ms delay becaused you "missed the buss"). VSYNC ON is like a scheduled bus. But VRR is like an infinite line up of taxis that will wait dynamic times for you. A VRR display is a waiting taxi for a frame-to-refreshcycle for you. The taxi will wait for you.

For delivering full frames (e.g. VRR or non-VRR VSYNC ON)
-- Fixed Hz over cable is like a scheduled bus. Frames have to wait for the next refresh cycle bus.
-- Variable Hz is like a taxi line up. Frames can transmit immediately to the monitor.

Fixed Hz: You have the problem of missing VSYNC, which is like missing a scheduled bus. Frames are forced to wait an extra refresh cycles. A five-frame sequence could be at time intervals of [16ms,16ms,16ms,33ms,16ms] stutter because one frame missed the metaphorical "VSYNC ON" bus. A stutter is an error margin of 1/60sec = 16.7ms.

VRR Hz: You never miss a VSYNC, since the display waits for for the refresh cycle to be delivered. A five-frame sequence could be at time intervals of [16ms,16ms,16ms,17ms,16ms] because the taxi waited an extra millisecond for for the frame to board the (HDMI, DisplayPort) taxi from GPU to display. No waiting!

Boom, 1ms stutter vs 16ms stutter. Whoo hoo, VRR fixed your stutter, since you never miss a VSYNC anymore during VRR.

A display often has a "VRR range", e.g. 48Hz-240Hz. That means you can allow time intervals between 1/240sec and 1/48sec between frame presentations, before you go out of spec and the GPU drivers take over (e.g. repeat-refreshing, aka LFC, aka Low Frame Rate Compensation). But as long as you're presenting frames within the intervals of the VRR range, video cable is transmitting immediately without waiting & your monitor is refreshing immediately as soon as the frame is complete. That's why fixed-Hz VRR is so deliciously lower latency than VSYNC ON. The higher your max Hz, the lower lag your low frame rates becomes, and you are perpetually framerate=Hz (e.g. "54fps" from MAME, "50fps" from PAL, "70fps" for DOS, etc).

As long as your frame rate is within the VRR range (e.g. 48fps-240fps for 48Hz-240Hz) -- the frame rate is the refresh rate, and the refresh rate is the frame rate -- there is no difference when VRR is enabled.

The approx workflow behind the scenes is roughly:

1. [SOFTWARE CONTROL]
1a. Application software renders a frame via APIs (Vulkan, DirectX).
1b. The GPU renders it as the application software instructs it to (e.g. Vulkan, DirectX).
1c. Application software presents the frame. The graphics drivers take over from there.
2. [GPU JOB]
2a. Graphics drivers commands GPU to commit presented frame buffer into the refresh cycle buffer workflow in accordance to currently configured synchronization technology.
(e.g. "VSYNC ON" buffering/waiting for next refresh cycle, "VSYNC OFF" tearing by interrupting/splicing into current refresh cycle at the current video output beam position, or "VRR" immediacy, etc)
2b. GPU begins transmitting the refresh cycle out of the GPU's video output
3. [DISPLAY JOB]
3a. Display receives the refresh cycle from the video input
3b. Display handles the refresh cycle in accordance to its own algorithm (simplest: synchronous scanout like a CRT)

Some software are very bad at VRR (very jittery frame presentation), and VRR can't fix erratic-tardy arrivals of frames. So games/software need to be properly tested for VRR. Fixed-framerate content like emulators need to precisely Present() with microsecond clocks or ultra-high-accuracy timers, rather than jittery default timer events -- since VRR eliminates the refresh cycle schedule -- and any timing imperfections in certain emulator can create more visible jitter in VRR than VSYNC ON. Example is jittery "60fps 60Hz" VRR [16ms,20ms,12ms,16ms,20ms] looks more jittery than perfect VSYNC ON [16ms,16ms,16ms,16ms,16ms] even if better than bad VSYNC ON non-VRR [16ms,33ms,16ms,16ms,33ms]. It's like you're doing 64Hz one frame, then 56Hz the next frame, then 62Hz the next frame. So "60fps 60Hz" could still jitter, if software is not framepacing. The display is just doing its job of pacing to the software's imperfections. So sometimes bad coding/libraries/timers/etc can cause VRR to stutter a fixed framerate more than a fixed-Hz display. So software-based frame pacing has to behave very accurately (like the accuracy of a fixed-Hz schedule), since in VRR your software is taking over the responsibility of timing those refresh cycles!

Hopefully, by explaining VRR from multiple angles to people with some familarity with refresh rates (like you, who understand 60Hz vs 70Hz) -- one part of my post hopefully gives you an "Eureka" or "Aha" moment.

- Did my Present()-to-Photons knowledge, help you understand VRR better?
- VRR is a clever minor modification to old-fashioned fixed-Hz, isn't it?

Being the Present()-to-Photons expert here, I can help demystify what happens between (A) after software completes a frame, and (B) photons hitting eyes.
Feel free to ask me additional questions about this black box. It's Blur Busters' job to de-mystify this black box.

____________

Relevance to thread: Indeed, VRR is usually a better lag-reducer for non-beamraced software like most DOS games.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 5 of 20, by Rincewind42

Posted on 2023-07-05, 12:12

Rincewind42 Offline

Rank Member

Rank: Member
Posts: 137
Joined: 2022-03-29, 01:42

mdrejhon wrote on 2023-07-05, 02:44:

Hopefully, by explaining VRR from multiple angles to people with some familarity with refresh rates (like you, who understand 60Hz vs 70Hz) -- one part of my post hopefully gives you an "Eureka" or "Aha" moment.

- Did my Present()-to-Photons knowledge, help you understand VRR better?
- VRR is a clever minor modification to old-fashioned fixed-Hz, isn't it?

Thanks for the detailed explanation, I've learned a lot from it. It all makes sense now, and it never occurred to me that on a say 240Hz display the latency would be effectively reduced to 1/4th of that of a 60Hz display... But yeah, it makes perfect sense; it couldn't work any other way.

mdrejhon wrote on 2023-07-05, 02:44:

Relevance to thread: Indeed, VRR is usually a better lag-reducer for non-beamraced software like most DOS games.

That was my conclusion too while reading your writeup. I also think now I understand the reasoning behind the "surge emulation" approach better, and I'm quite sure we're getting something like that in DOSBox "for free", so to speak because of the 1ms emulation "tick" rate, then a new frame gets presented whenever it's ready, at 1ms granularity.

Fascinating stuff!

Reply 6 of 20, by Oetker

Posted on 2023-07-05, 15:52

Oetker Offline

Rank Oldbie

Rank: Oldbie
Posts: 672
Joined: 2019-10-27, 09:35
Location: Netherlands

mdrejhon wrote on 2023-07-05, 02:44:

Thanks for your detailed explanation. Two questions:
1. How did the original g-sync work, did it use a buffer in the monitor or something, as monitors were expensive...
2. What is the interaction between setting v-sync on/off in games and a vrr monitor, I've read conflicting opinions on what to pick (or if it does anything).

Reply 7 of 20, by mdrejhon

Posted on 2023-07-05, 19:17

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Rincewind42 wrote on 2023-07-05, 12:12:
Thanks for the detailed explanation, I've learned a lot from it. It all makes sense now, and it never occurred to me that on a say 240Hz display the latency would be effectively reduced to 1/4th of that of a 60Hz display... But yeah, it makes perfect sense; it couldn't work any other way.

Thank you!

It's a compliment when I succeed in explaining VRR in a way nobody else does;

Even many computer-literate people and software developers -- still get tripped up on what VRR /exactly/ does behind the scenes! People familiar with old-fashioned scanout, don't "get it" when VRR is explained by people who isn't familiar with old fashioned scanouts. The explanation has gaps -- and that's my actual literal job to de-mystify this black box of Present()-to-Photons. This even puts food on the table; I'm currently contracted by a 240Hz OLED manufacturer for testing.

Rincewind42 wrote on 2023-07-05, 12:12:
mdrejhon wrote on 2023-07-05, 02:44:

Relevance to thread: Indeed, VRR is usually a better lag-reducer for non-beamraced software like most DOS games.

That was my conclusion too while reading your writeup. I also think now I understand the reasoning behind the "surge emulation" approach better, and I'm quite sure we're getting something like that in DOSBox "for free", so to speak because of the 1ms emulation "tick" rate, then a new frame gets presented whenever it's ready, at 1ms granularity.

I'm not sure how Dosbox does it -- there are many workflows for surge-execution.

However...

The way surge-execution works is that an entire entire refresh cycle of ticks is run at full native machine performance. So multiple consecutive ticks are consecutively surge-executed, until a whole emulated refresh cycle is generated. If the machine uses 1ms ticks and the machine's refresh rate is 60Hz, it will generally run 16.7ms worth of ticks at ultra-fast speeds. Even the Real Time Clock (real world) temporarily surge-ticks along with the surge-execute, so the emulator doesn't even know it's being surge executed in a "time accelerated" manner.

So 16 or 17 ticks could execute all at once in a time-accelerated manner, in order to create the emulated refresh cycle (frame) as quickly as possible, for display on the real refresh cycle. (Minimize time between input read and real refresh cycles).

In theory, with an infinitely-fast computer, an entire 16.7ms of emulation (for 60Hz) is executed in 0ms, including all input reads. That means zero latency between input reads (0ms) and the emulated refresh cycle.

For Dosbox, if it does this specific "ideal" surge-execution technique, a single 1/70sec amount of emulation (all ticks within included) would run at warp speed to generate the emulation frame as quickly as possible.

Careful optimization needs to be done to decouple the real real-time-clock from the emulated real-time-clock, so the emulated real-time-clock correctly ticks fast (in surges) during surge-execute. So that clock glitches do not appear.

Also, WinUAE does this (kudos to Toni who understood that it can be somewhat combined): Surge-execute method can be combined with beam-racing though, if you have enough refresh rate. At 240Hz, WinUAE has the capability to surge-execute 4x faster while using lagless vsync (240:60 Hz ratio), by scanning one 1/60sec refresh cycle in 1/240sec due to the fast real-world raster. This is even lower latency than using only surge execute method to fully construct an emulated refresh cycle, since now you're forced with 1/240sec latency scanning out the last scanline with no further input reads (no mid-raster input reads). This will be even lower than original-machine latency (on average).

Remember, surge-execute methods is a great way to reduce latency, but has two disadvantages (A) Requires sufficient performance, and (B) it will never have 1:1 symmetry with original latencies. Some games (pinball) can do beamraced mid-screen input reads, like reading input in a raster before setting pinball flippers, or adjusting the position of a paddle in a breakout-style game. Raster-accurate input reads will be only timing-accurate with the lagless vsync technique WinUAE uses. If original-machine latencies are desired regardless of when input was read, then the only way to replicate latency symmetry is with synchronizing emulator raster with the real-world raster, on a 1/60sec scanout velocity. WinUAE achieves this, if the display is configured to a 60Hz fixed-Hz refresh rate (instead of max Hz / VRR enabled), since a 1080p or 4K 60Hz refresh cycle scans out in the same time as an NTSC refresh cycle.

WinUAE is pretty highly configurable in how you reduce its latencies. WinUAE can skip lagless vsync and instead use the surge-execute technique, if you have a platform fast enough to surge-execute WinUAE refresh cycles. But it won't beat the latency of 60Hz-real-world "lagless vsync" unless you own a really high refresh rate (240-360Hz) monitor, and have a computer fast enough to surge-execute WinUAE by 4x-6x.

But in all cases, combining lagless vsync + surge execute at the same time in WinUAE (which it can do), is the best-of-all-worlds latencies. This is why enabling lagless vsync at 120Hz feels lower lag than lagless vsync at 60Hz -- it's surge-executing a beam-raced 60Hz refresh cycle in a 1/120sec fast-beamrace with every other 120Hz refresh cycle. And even lower if you own a 240Hz monitor (I recommend the new 240hz OLEDs, by the way) and if WinUAE is capable of executing 4x faster than a real Amiga. Very compute intensive, but the lowest possible (original-Amiga-beating) latencies.

Lagless vsync (at low frameslice counts, 2-4) can be less compute intensive than surge-execute technique if you can achieve even very approximate timing margins required. So anyone considering reducing lag of beamraced graphics chips (e.g. 8bit platforms) on Arduino/microcontroller platforms, should seriously consider the frameslice beam racing technique to (coarsely) synchronize the emulator raster to the real raster.

Hats off to emulators that support either in a configurable way. Standing ovation to the smart emulator developer if the emulator supports both concurrently (WinUAE).

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 8 of 20, by mdrejhon

Posted on 2023-07-05, 19:35

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Oetker wrote on 2023-07-05, 15:52:

Thanks for your detailed explanation. Two questions:
1. How did the original g-sync work, did it use a buffer in the monitor or something, as monitors were expensive...

Native G-SYNC historically worked pretty mostly the same way -- variable blanking interval.

The difference is in the low frame rate handling -- in native G-SYNC the repeat-refresh cycle frame is buffered in the monitor, and in generic VRR, the repeat-refresh cycle frame is buffered by the graphics driver.

The memory on a G-SYNC native chip (128 megabytes in the earliest 1080p 144Hz G-SYNC models) is likely used for the massive-sized overdrive lookup tables.

- Without VRR, LCD overdrive is a simple A(B)=C algorithm where A is original color (per color channel/subpixel), B is destination color, and C is the overdriven color value to accelerate transition A to B. For example, you can use intensity 220 to accelerate a pixel transition from 50 to 200 without overshoot, but this is very LCD panel specific. Now that's a 256x256=64KB lookup table. Most manufacturers have optimized this to a smaller 17x17 overdrive lookup table -- Google Scholar -- great primers on why overdrive lookup tables are used for modern LCD overdrive. Also, there is a setting called "Overdrive Gain" which simply affects a multiplier on C value using the delta between B and C in the A(B)=C. So if you use a stronger overdrive setting, this will increase the difference between B and C in the A(B)=C.

- With VRR, LCD overdrive becomes horrendously complex 3D or 4D lookup tables. You need to add frametime, since LCD GtG behaves differently at different refresh rates, and a variable refresh rate display has infinite number of refresh rates. Sometimes algebra is used (e.g. regression formulas). Since a gaming monitor has to process billion(s) of overdrive calculations per second, the large memory in a G-SYNC native monitor is probably for the large performance-optimizing dynamic-VRR-compatible overdrive lookup tables. I have not fully confirmed this with NVIDIA, but I can assure you that eliminating overdrive artifacts on G-SYNC is very tough, but NVIDIA pulled off a tour de force. With ULMB2, the overdrive calculations become even more complex (timing the GtG to be perfect at the moment of strobe flash, AND also gamma-correcting the predicted overdrive artifact) -- I explain the gamma-corrected overdrive algorithm at https://www.blurbusters.com/ulmb2#overdrive

However, Native G-SYNC and generic G-SYNC generally uses the variable-size blanking interval technique -- that is a commonality. NVIDIA drivers do have to poll a G-SYNC native monitor to check if the monitors' repeat-refresh logic is executing (and hold back the new refresh cycle), but generic VRR just use the signal-side to handle repeat-refreshing during low frame rates. Buggy drivers and buggy VRR implementations can sometimes cause the monitor to go black if it doesn't send a new refresh cycle before accidentally going below spec (e.g. 48Hz, because it took more than 1/48sec to deliver a new frame). In the best case, displays need to keep displaying something until extremely low Hz (e.g. 30-38Hz), to allow for computer performance to give the drivers time to send a new refresh cycle "on time". Generic VRR sometimes works fine on LCDs not designed for VRR -- that's why some early DVI monitors was working with FreeSync unofficially. LCD panels were sometimes designed to seamlessly switch to a lower Hz for power management (laptops, etc), so they were "accidentally" compatible with VRR, despite not being designed for VRR. Generic VRR includes all other VRR standards which work pretty much the same way (VESA Adaptive Sync, FreeSync, HDMI VRR), it's mostly plug-and-play discovery differences (e.g. how to tell the computer the VRR range of the monitor). A big disadvantage of generic VRR is crappy overdrive (overdrive artifacts that appear/disappear at different frame rates).

Also, framepacing on G-SYNC native is usually more microsecond-accurate than framepacing on generic VRR, due to some of the extra optimizing that NVIDIA did. So usually fewer stutters appear on G-SYNC native screens than generic VRR screens.

Motion quality (for picky people) is why it was usually very much worth it to pay the NVIDIA G-SYNC premium. Fewer stutters, less overdrive artifacts. Or at least make sure it's NVIDIA certified FreeSync ("G-SYNC Compatible"), which usually means a better-than-average generic VRR (called FreeSync by AMD). VESA Adaptive Sync and AMD FreeSync is cross-compatible with each other (on the same cable with most of the same fields/parameters of the plug and play DisplayID format). HDMI VRR is just a HDMI-ized version of FreeSync, with HDMI Forum creating some additional specifications for improved HDMI compatibility.

But you can generically adaptor VRR to any signal format (RGB, VGA, DVI) since VRR is a surprisingly minor modification of a 100-year-old scanout methodology. Monitors can certify generic VRR with both AMD and NVIDIA, so you can get both "FreeSync" and "G-SYNC Compatible" logos. Even if it's not G-SYNC Ultimate / G-SYNC Native. Now, many G-SYNC Native monitors now finally support FreeSync signals, so you can use an AMD card with a newer NVIDIA G-SYNC native monitor! (Unlike before). Few CRTs successfully sync to a varying-Hz signal, but if done, "blind VRR output" (forcing it without plug-and-play) only works on an AMD card, and only on HDMI output (but you can adaptor the HDMI to anything, as long as you use a passive or unbuffered 1:1 adaptor, to keep pixel delivery (scanout) timing-accurate. It requires ToastyX CRU to force FreeSync range in the abscence of a FreeSync EDID, and using tight VRR ranges (e.g. 56Hz-75Hz), piped through a HDMI-to-VGA adaptor. Large ranges and fast framerate fluctuations can be problematic to CRTs (sudden 56fps change to 75fps may cause a mode change blackout, but a smooth gradual framerate change won't on certain CRT models). And it depends on how forgiving the multisync circuitry is on the CRT tube.

Some interesting data on FreeSync hacking:
https://forums.blurbusters.com/viewtopic.php?t=3234
https://forums.anandtech.com/threads/freesync … .2488696/page-2

Hope that covers the VRR rabbit hole, but it's amazing that all of them have many commonalties -- sufficient and generic enough to spill beyond the DisplayPort/HDMI silo (in some cases).

Oetker wrote on 2023-07-05, 15:52:
2. What is the interaction between setting v-sync on/off in games and a vrr monitor, I've read conflicting opinions on what to pick (or if it does anything).

It's a fallback sync technology for frame rates outside VRR range.

When frame rates try to exceed the VRR range...
...It can use VSYNC ON handling (sudden double buffering lag appears if framerates hit max Hz), or
...It can use VSYNC OFF handling (tearing appears if framerates exceed max Hz).

That's why it's recommended to cap the maximum frame rate a few fps below max Hz. The margin is because frame rates fluctuate, at 144fps, one frame maybe 1/140sec and another frame may be 1/150sec. One of those frames will get the fallback sync technology treatment (lag or tearing). So the capping margin depends on the accuracy of your frame rate capping and the driver's accuracy of timing frame delivery to monitor. At higher Hz, you have less time, so you may need to use bigger capping margins such as 340fps cap at 360Hz, to prevent the fallback sync technology from activating.

In an ideal world, if frame rates never exceed VRR range, the fallback sync technology does not matter as it never activates. However, frame rate caps are imperfect averagers. Therefore, you will have occasional too-fast frames -- before the last frame is finished scanning-out onto the cable (and thus, panel). So you still want to configure a fallback sync technology to your user preference. I prefer VSYNC ON, if it's only activated on one frame every 10 seconds -- as I am more sensitive to tearing than lag in casual gaming.

Most people in competitive gaming use VSYNC OFF in esports, due to the latency-change problems of entering/exiting VRR range. The current esports advice, is if you want to enable VRR in esports (competitive), purchase more VRR range than you ever need, so your frame rates completely breathe inside the VRR range even at 300fps. Then you don't have to deal with the capping [BLEEP] or the latency complaints (of hitting max Hz). 360Hz+ monitors have a ginormous VRR range of 48Hz-360Hz, enough to let many esports game breathe framerate completely inside VRR range, to prevent the lagfeel-change of framerates entering/exiting VRR range.

On the related topic of the old myth of limiting your Hz to your framerate needs (sigh, self-sabotage), here is a PSA:

TIP: Try to purchase more refresh rate than you think you need, if there's no tradeoff for your situation

High Hz used to cost lots. But it's an increasingly free feature (at 90-120Hz) in some phones, tablets, consoles, and sometimes your existing TV. It's mainstreaming slowly, much like our progression from NTSC to 4K, and even 1000Hz display refresh rates may be mainstream later this century. Who knows? Blur Busters is the figurative Hz equivalent of 1980s Japan MUSE HD researchers, anyway. 4K used to be over $10,000 in 2001 (IBM T221), now it's a $299 Walmart special.

- You avoid the "VSYNC ON" or "VSYNC OFF" problem in VRR because higher Hz makes it easier to keep framerates inside VRR range.
- Even "60fps" lag from emulators is lower when you have a higher-Hz VRR display.
- Higher Hz has improved and more flawless software black frame insertion. 240Hz can reduce 60Hz motion blur by 75% via 3:1 black:visible ratio. RetroArch supports a command line option for improved BFI at higher Hz. OLEDs are immune to image retention from software BFI.
- 1000Hz is no longer just for esports, if it's free. Web browser scrolling is 16x clearer than 60Hz, and 8x clearer than 120Hz.
- Upgrade geometrically for human visible benefit (60 -> 144 -> 360 -> 1000). 240-vs-360 LCD is a tiny 1.5x difference throttled to 1.1x due to slow GtG and jitter
- You avoid the G-SYNC disadvantages of frame rates trying to exceed Hz, if you keep max Hz above your max framerate.
- You have less stroboscopic effects, and you can rely on less GPU blur effect as an accessibility feature (people with stroboscopic-effect eyestrain) to bridge briefer frametimes, see www.blurbusters.com/stroboscopics
- I hate refresh rate incrementalism. 240-vs-360 is worthless to me. 2x-4x geometrics is the cat's beans!
- More ergonomic motion-sickness-reducing options become available, whether you prefer lower frame rates or higher frame rates, since the extra Hz provides additional options (like extra motion blur reduction via software BFI, if you get motionblur headaches).

Even grandma can tell 240Hz-vs-1000Hz better than 144Hz-vs-240Hz. it's all about geometrics. 1ms of frametime (pixel visibility time) translates to 1 pixel of motion blur per 1000 pixels/sec, if you aren't flickering like a CRT.

If you're not flickering (CRT, BFI, ULMB, strobing) to kill display motion blur -- then on flickerless sample and hold (but still 0ms GtG), the 240Hz vs 1000Hz motion blur (during eye tracking) is the same blur difference as a camera photo of 1/240sec shutter versus 1/1000sec shutter. Display behave differently if your eyes are moving vs stationary, as seen at www.testufo.com/eyetracking (that one is capped to ~80fps for demo's sake). This is an artifact of a finite frame rate and refresh rate on a sample and hold display. Now, if motion speeds are fast enough (screen motion or photographed motion), you WILL see the difference between 240Hz and 1000Hz clearly -- over 90% of population can in lab tests. This is ergonomic flicker-free method of display motion blur reduction to get closer to real life's infinite frame rate.

display-persistence-blur-equivalence-to-camera-shutter.png

This assumes GtG=0 (no extra GtG blur on top of MPRT blur) -- don't forget GtG pixel response (pixel change) and MPRT pixel response (pixel static visibility time -- sample and hold) -- GtG versus MPRT: The Two Pixel Response Standards.

LCD 240-vs-360Hz is only 1.5x (throttled to 1.1x due to slow LCD GtG). But thanks to very fast GtG, the new 240Hz OLEDs generates motion slightly more clearly than a non-strobed 360Hz LCD (at framerate=Hz). This is since you've nearly zeroed out GtG pixel response, so GtG is no longer a blur component, even if MPRT (persistence) blur continues to exist. So 120-vs-240 is much more visible to the average consumer on an OLED than LCD. So most consumers and casual gamers should go VHS-vs-8K temporally (via 4x geometric upgrades) rather than mere 720p-vs-1080p (like small Hz and framerate upgrades).

That's why RetroArch software BFI (via the special option) performs better on my 240Hz OLED (pulsing 1/4 as long) than on my 120Hz OLED.

A great TestUFO software demo of variable-persistence BFI: www.testufo.com/blackframes#count=4&bonusufo=1 .... but try to view this on a high-Hz monitor.

Now that 240Hz OLEDs have arrived, there are no picture-quality degradation disadvantages of high Hz like it was for early TN monitors. Don't hold back the Hz, if you care about motion quality, or if you care about lag (even for low frame rates).

P.S. 1000fps UE5 quality is now possible on an RTX 4090 via 10:1 reprojection (warping). I just put up a new article about lagless and artifactless frame generation: www.blurbusters.com/framegen -- including a small infographic about a lagless frame generation algorithm that avoids interpolation lag disadvantages -- so we can use reprojection as a display motion blur reduction technology .... Someday, brute-framerate motion blur reduction will replace flicker-based motion blur reduction (CRT technique) for modern gaming. But I prefer my retro content to stay at original unmolested frame rates, and use flicker-based motion blur reduction (CRT technique) to fix the motion blur of my retro content.

(Aside: Motion blur is good if you want it. I love 24fps Hollywood Filmmaker Mode, but understanding the mainstream human visible worthwhileness of 1000fps 1000Hz for interactive content -- is quite damn useful, since I would love VR headsets to stop blasting flicker into my eyes -- via brute framerate-based motion blur reduction techniques instead -- but flicker is the lesser of evil due to motion blur headaches in VR. So there are definitely use cases of "infinite-ish" refresh rates to simulate real life infinite frame rates, to prevent a difference between real life and virtual reality -- real life doesn't strobe to fix extra display motion blur forced upon your eyes. There are use cases you DO NOT want motion blur, at all, and sometimes flicker is the lesser evil versus motion blur. But I'd like to have cake and eat it too -- blurless and strobeless -- achievable by brute frame rates such as 1000fps 1000Hz)

This is a slight sidetrack, but it's part of the Blur Busters mission to mythbust old Hz assumptions and stop people laughing with real scientific research.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 9 of 20, by Rincewind42

Posted on 2023-07-06, 00:04

Rincewind42 Offline

Rank Member

Rank: Member
Posts: 137
Joined: 2022-03-29, 01:42

mdrejhon wrote on 2023-07-05, 19:17:
The way surge-execution works is that an entire entire refresh cycle of ticks is run at full native machine performance. So mu […]
Show full quote

The way surge-execution works is that an entire entire refresh cycle of ticks is run at full native machine performance. So multiple consecutive ticks are consecutively surge-executed, until a whole emulated refresh cycle is generated. If the machine uses 1ms ticks and the machine's refresh rate is 60Hz, it will generally run 16.7ms worth of ticks at ultra-fast speeds. Even the Real Time Clock (real world) temporarily surge-ticks along with the surge-execute, so the emulator doesn't even know it's being surge executed in a "time accelerated" manner.

So 16 or 17 ticks could execute all at once in a time-accelerated manner, in order to create the emulated refresh cycle (frame) as quickly as possible, for display on the real refresh cycle. (Minimize time between input read and real refresh cycles).

In theory, with an infinitely-fast computer, an entire 16.7ms of emulation (for 60Hz) is executed in 0ms, including all input reads. That means zero latency between input reads (0ms) and the emulated refresh cycle.

For Dosbox, if it does this specific "ideal" surge-execution technique, a single 1/70sec amount of emulation (all ticks within included) would run at warp speed to generate the emulation frame as quickly as possible.

Yeah, so we do the "time accelerated" thing on a per-tick basis only, so in 1ms increments. Then I'm a bit unsure again how am I getting low latencies in Pinball/Arkanoid games with fixed 70Hz + vsync... but clearly it works *somehow*. Admittedly, I'm more of an audio guy and haven't delved much into the rendering/host integration code yet. I'm actually wondering how this "surge execute 16.7ms" worth of emulated computer time would work with audio. We need to feed those audio buffers constantly in a hard-real-time fashion to avoid dropouts and crackles. A 512-sample buffer at 48k is doable, although not exactly great as it results in a 10.67 ms latency at the mimimum already. 256-samples would be a lot better, the magic starts to happen at or below 5ms to perceive it as "no latency".

In a way, we're doing something similar to the "surge emulation" approach in audio land. Most programs update the registers of the sound synthesiser chips (e.g. OPL) at relatively slow rates (well above 1ms intervals), but some more advanced ones do so at higher than 1000Hz. So what we do is when we receive a register write to the sound chip, we render the audio up to that point in a buffer capable of holding 1ms worth of samples, then when the 1ms tick interval hits, we simply continue rendering to fill up the buffer using the current state of the chip, then send the results downstream. So this should work (at least in theory) with ultra low 1-3ms audio buffers (128 and 64 sample buffers are very much doable on pro audio cards).

I'm *super annoyed* when I hear *any* sort of audio crackling in an emulator. My C64 and Amiga *never* made a single audio glitch ever, and neither my GUS in my 486, so I don't accept anything less in an emulator. Keypress-to-sound latency should be in the very low single-digit millisecond range like on real hardware (I've spent many thousands of hours in trackers writing MOD & XM music so I expect the virtually instantaneous response times I took for granted in the 80s and 90s... imagine my utter shock when I started to use music software on Windows for the first time... next thing I had to do was save up money for a pro interface that had low-latency ASIO drivers...)

mdrejhon wrote on 2023-07-05, 19:17:
Also, WinUAE does this (kudos to Toni who understood that it can be somewhat combined): Surge-execute method can be combined wit […]
Show full quote

Also, WinUAE does this (kudos to Toni who understood that it can be somewhat combined): Surge-execute method can be combined with beam-racing though, if you have enough refresh rate.
...
But in all cases, combining lagless vsync + surge execute at the same time in WinUAE (which it can do), is the best-of-all-worlds latencies. This is why enabling lagless vsync at 120Hz feels lower lag than lagless vsync at 60Hz -- it's surge-executing a beam-raced 60Hz refresh cycle in a 1/120sec fast-beamrace with every other 120Hz refresh cycle. And even lower if you own a 240Hz monitor (I recommend the new 240hz OLEDs, by the way) and if WinUAE is capable of executing 4x faster than a real Amiga. Very compute intensive, but the lowest possible (original-Amiga-beating) latencies.
...
WinUAE is pretty highly configurable in how you reduce its latencies. WinUAE can skip lagless vsync and instead use the surge-execute technique, if you have a platform fast enough to surge-execute WinUAE refresh cycles. But it won't beat the latency of 60Hz-real-world "lagless vsync" unless you own a really high refresh rate (240-360Hz) monitor, and have a computer fast enough to surge-execute WinUAE by 4x-6x.
...
Standing ovation to the smart emulator developer if the emulator supports both concurrently (WinUAE).

I love WinUAE, and Toni deserves a medal, no doubt. That emulator is a piece of art.

Reply 10 of 20, by mdrejhon

Posted on 2023-07-06, 00:44

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Rincewind42 wrote on 2023-07-06, 00:04:

mdrejhon wrote on 2023-07-05, 19:17:
The way surge-execution works is that an entire entire refresh cycle of ticks is run at full native machine performance. So mu […]
Show full quote

The way surge-execution works is that an entire entire refresh cycle of ticks is run at full native machine performance. So multiple consecutive ticks are consecutively surge-executed, until a whole emulated refresh cycle is generated. If the machine uses 1ms ticks and the machine's refresh rate is 60Hz, it will generally run 16.7ms worth of ticks at ultra-fast speeds. Even the Real Time Clock (real world) temporarily surge-ticks along with the surge-execute, so the emulator doesn't even know it's being surge executed in a "time accelerated" manner.

So 16 or 17 ticks could execute all at once in a time-accelerated manner, in order to create the emulated refresh cycle (frame) as quickly as possible, for display on the real refresh cycle. (Minimize time between input read and real refresh cycles).

In theory, with an infinitely-fast computer, an entire 16.7ms of emulation (for 60Hz) is executed in 0ms, including all input reads. That means zero latency between input reads (0ms) and the emulated refresh cycle.

For Dosbox, if it does this specific "ideal" surge-execution technique, a single 1/70sec amount of emulation (all ticks within included) would run at warp speed to generate the emulation frame as quickly as possible.

Yeah, so we do the "time accelerated" thing on a per-tick basis only, so in 1ms increments. Then I'm a bit unsure again how am I getting low latencies in Pinball/Arkanoid games with fixed 70Hz + vsync... but clearly it works *somehow*.

Even without true full-frame inputdelay (via surge-execute full refresh cycles)

You said you're using VRR, right? VRR is typically more common on higher-Hz monitors (e.g. 165Hz). So most of the lag reduction is caused by VRR's fast scanout. Those "70Hz" refresh cycles are scanned out onto your screen at max-Hz speed (e.g. 1/165sec for 165Hz monitor). 1/165sec is ~8.2ms less than 1/70sec.

Even if you emulate real-time 1/70sec into a refresh cycle, the frame is still delivered promptly to the actual display -- simply due to VRR capable of delivering low-Hz frames at max-Hz velocity over the cable.

Rincewind42 wrote on 2023-07-06, 00:04:
Admittedly, I'm more of an audio guy and haven't delved much into the rendering/host integration code yet. I'm actually wondering how this "surge execute 16.7ms" worth of emulated computer time would work with audio. We need to feed those audio buffers constantly in a hard-real-time fashion to avoid dropouts and crackles. A 512-sample buffer at 48k is doable, although not exactly great as it results in a 10.67 ms latency at the mimimum already. 256-samples would be a lot better, the magic starts to happen at or below 5ms to perceive it as "no latency".

Positive latency can happen, but sometimes it's negative latency too (surge-execute methods and certain modern esports equipment still creating less audio latency than original machine)

Depending on whether it's button-to-audio perceptuals or visual-to-audio perceptuals, sometimes it's a zero-basising on the wrong frame of reference. As Einstein says, "it's all relative".

...Confirming "Tick" terminology clarification: I'm assuming one tick = 1ms, not the old timer interrupt that are 18.2Hz. 1ms surge executes would only reduce latencies by 1ms, not by a full refresh cycle. Most of your lag reduction is from the "quick frame transport" effect of VRR explained earlier, since (1/165sec)-(1/70sec) = lag reduction of 8.2ms for completing a refresh delivery (over cable) and display scanout (onto panel) -- so most of your lag reduction is from the higher Hz (and lack of traditional VSYNC ON double buffering) even when surge-execute is turned off.

If it's visual-to-audio perceptuals, buffered audio doesn't feel lagged because the destination display is lagged by slow scanout. Surge-execute + slightly buffered audio, simply still kinda plays out in sync with the delayed real world scanout (because you didn't do lagless vsync, aka beam-race frame slice beam racing). You did a surge-execute of an emulator scanout, but the real-world display will still be likely scanning it out slower than your surge-execute. This brings the buffered audio back in sync. Also, now even keyboards have bounce behaviors (e.g. 10ms antibounce). Those original keyboards sometimes had lag from those things. You simply use a 1000Hz gaming keyboard with fast key actuators, so you can undo some of the "button-to-audio" perceptuals, without sacrificing the "visual-to-audio" perceptuals (caused by the lagged scanning out of a surged-executed frame, in perfect sync with lagged buffered audio). So the earlier delivery of inputread (from faster key actuation) may also cause you to no longer feel an audio delay from surge-execute buffers.

But there is an additional technique -- make sure your audio buffer is capable of almost becoming empty, but make sure you do precise frame presentation (that jitters less than your audio buffer error margin).

Metaphorically, it's like a nearly-empty glass being suddenly refilled (during surge-execute). The first molecules (e.g. bits of audio) of the new glass pour will be low latency, even if it will take a full refresh cycle to "pour" the audio glass. And it's often in sync with a display scanout which is still a latency -- the final pixels of the bottom edge of a display will be refreshing later than the first pixels of the top edge of a display. This can have an effect on "audio-to-photons"-sensitive people.

It all depends on what perceptuals you are framing your lagfeel on (e.g. inputread-to-audio, or photons-to-audio). Surge-execute can "prepare-ahead" audio waveforms faster than the original machine, and the first few bits of the buffer will still play at ultra-low-latency. It's simply that some audio 16ms "into the future" was already rendered in a mere 4ms (at 4x surge-execute), but the first bits of the audio buffer is already playing, if you've delivered the frame fast and timed the playback of the first few bits immediately.

Thus, no audio lag (in specific situations), even without a gaming keyboard.

But there can be audio lag from the buffering. It depends on all the variables though.

How nice that an Einsteinian style "Frame of reference" works out, eh? (surge-executes is metaphorically like the time-dilation effects!). We can't predict the latency of the end users' display. The display can refresh sooner or later than the audio, and surge-execute can help synchronize audiotime:photontime better, due to the "delayed scanout" effect, and you're just playing the later-buffered audio bits in sync with the later parts of display scanout.

But the first photons of something beginning to move near the speed of light near you, will still be hitting you quite instantly -- because the object will still be near you. Likewise, the first bits of the audio in the surge-execute audio buffer, is not lagged. It's just audio 16ms into the future, was all surge-executed ahead of time. Now, if you have a late-refresh inputread (e.g. inputread that occurs very late in a refresh cycle, e.g. near end of 70Hz), the surge-execution may increase latency accidentally (from the frame of reference of button-to-audio), but that is dependant on how fast the final refresh cycle is being scanned out.

There's a lot of interactions, but surge-execute can be lagless depending on when the inputread occured in a refresh cycle. Many games read input only in the blanking interval, so such surge-execution can decide to execute blanking interval scanlines first. Or letting the "phase offset" of surge execute algorithms be adjustable by the end user.

Some are button-to-audio sensitive, and others are visuals-to-audio sensitive. And sometimes you can (or can't) have cake and eat it too. Even some developers accidentally assume incorrect of references (like "off by 1" errors) and this may be one of these times.

Single-tick surge executes probably only reduces latency by 1 tick time. But it can keep audio sync better, in certain perceptual situations especially if the game does mid-refresh-cycle input reads.

For VRR, most of your lag reduction is that frame presentation (e.g. Present()) means the bottom edge of screen will hit eyes in 1/(max Hz) time, not 1/70sec.

So most of your lag reductions are coming from only the VRR-scanout benefit only. Since on 165Hz, you have a (1/70sec) - (1/165sec) = 8.2ms latency reduction of bottom scanline (relative to original scanout velocity). This is with surge execution turned off. This lag reduction is gigantic enough to compensate for lots of emulator overhead (like the giant lag problem of executing a full emulator scanout long before beginning a display scanout).

Do you understand why VRR is the major latency reducer, even without surge-execute (even no tick-based surge executes)?

Right Settings for the Right Job.

Anyway, let this all be configurable to the user.

Now...

It is true that the only way to get perfect original-reproducing sync between latencies/audio/visuals is beam-raced sync methods (like WinUAE).

This is since surge-execute methods can produce audio to sound 'early' or 'late' (negative latency) relative to buttons in some situations -- depending on when in an emulated refresh cycle the inputread occured. The scanout velocity of the emulator is out of sync with the scanout velocity of the real world, so you've got different latencies for different-raster input reads, even if the average latency is lower. So that's why I love lagless vsync methods as a perfect "original latency reproducer" (for all, inputreads, audio, display).

Now, audio lag increase/decrease can go either way, depending on when the input read was done, when surge execute starts and finishes, and how fast/slow/lagged the destination display scans out. Digital displays DO have more latency than analog displays, even if the fastest esports displays are only low single-digits laggier (1 or 2 or 3 ms extra relative to analog, mostly port transceiver lag and pixel response lag).

Do you get the (rough, general, approximate) "frame of reference" concept I'm trying to explain?

Rincewind42 wrote on 2023-07-06, 00:04:
In a way, we're doing something similar to the "surge emulation" approach in audio land. Most programs update the registers of the sound synthesiser chips (e.g. OPL) at relatively slow rates (well above 1ms intervals), but some more advanced ones do so at higher than 1000Hz. So what we do is when we receive a register write to the sound chip, we render the audio up to that point in a buffer capable of holding 1ms worth of samples, then when the 1ms tick interval hits, we simply continue rendering to fill up the buffer using the current state of the chip, then send the results downstream. So this should work (at least in theory) with ultra low 1-3ms audio buffers (128 and 64 sample buffers are very much doable on pro audio cards).

You need to consider the frame of reference (button-to-audio) vs (visuals-to-audio), as well as the fact that you're simply executed-ahead some audio faster.

The first bits of the audio buffer isn't lagged since the buffer is already empty or almost empty right before your next surge-execute, so the first part of the next surge-executed audio doesn't have lag relative to the emulation, even for button and key presses, depending on when the input reads were executed in a refresh cycle (VBI-based inputreads versus mid-raster inputreads). There can be unfixable audio lag effects, but there can also be negative audio lag too (audio less lagged than original machine) -- it's just a matter of all the settings/variables/display/etc. Like how fast the refresh cycle transport is on the DisplayPort cable -- your surge execute is compensating for other latencies in the chain and they can essentially (approximately) undo each other at various settings.

There are more variables that are not being considered by emulator developers who don't understand Present()-to-Photons black box as well as I do.

At the end of the day, end-user configurability is key. Go with conservative defaults though (like existing tick-granularity surge-executes).

Rincewind42 wrote on 2023-07-06, 00:04:
I'm *super annoyed* when I hear *any* sort of audio crackling in an emulator. My C64 and Amiga *never* made a single audio glitch ever, and neither my GUS in my 486, so I don't accept anything less in an emulator. Keypress-to-sound latency should be in the very low single-digit millisecond range like on real hardware (I've spent many thousands of hours in trackers writing MOD & XM music so I expect the virtually instantaneous response times I took for granted in the 80s and 90s... imagine my utter shock when I started to use music software on Windows for the first time... next thing I had to do was save up money for a pro interface that had low-latency ASIO drivers...)

Yes, audio driver lag is a big problem in reproducing original-machine latency.

Especially if you setup the driver to driver intentionally automatically decides to use a bigger buffer because of those buffer-nearly-empty moments. So use the appropriate APIs where possible to avoid that (fixed buffers etc that emulators typically try to do). If you use tight audio buffer margins, make your Present() on VRR as sub-millisecond or even microsecond accurate as you can -- as exactly emulated Hz apart as possible -- with minimal time jitter.

Imperfect timer-based emulator frame presentation on VRR will produce audio crackles when you try to tighten margins, as many timers have a 1ms jitter. Some high performance timers may do, although my fave in VRR framepacing precision is busywait-based frame presentation (RTDSC or QueryPerformanceCounter or other high-precision clocks), followed by an intentional Flush() after presentation. This produces amazing deliciously smooth frame pacing in an emulator. (Flush is expensive and wastes 50% of GPU, but it makes refresh cycle timing even more deterministic on VRR, useful for perfectly emulating refresh rates without each consecutive emulated refresh cycles.

You can have a perfect 70.086Hz average, but individual refresh cycles are not necessarily spacered apart exactly 1/70.086Hz apart. If you use timer-based frame presentation, you can be VRR-jittered via [69Hz,70Hz,71Hz,69Hz,72Hz,68Hz,70Hz] which can sometimes cause crackles during those too-slow refresh cycles. RTSS uses variants of microsecond-accurate frame rate capping technique. You'd be averaging 70.086 Hz, but you're not presenting every frame exactly 1/70.086sec apart if you're using a timer-based present. Those slow intervals are what causes an audio buffer to empty since the next refresh cycle didn't refill the buffer.

So an optional configurable option to burn a single thread of a multicore with a raised-priority-thread "busyloop+present+flush" (to time your next Present() exactly 1/70.086sec after the previous Present() as microsecond accurate as you can) = creates shockingly deterministic intervals like [70.086Hz,70.086Hz,70.086Hz,70.086Hz,70.086Hz,70.086Hz] for consecutive emulated refresh cycles = you can use tighter audio buffer margins! Burns more laptop battery, but reduces audio latency. So pro/con, should be configurable.

While I'm deaf, I still have an understanding of waveform physics, and time-differentials (e.g. time delta between button and audio, time delta between audio and photons), and also how different humans keys on different things. Maybe that's part of why I am a visual expert in Present-to-Photons.

Rincewind42 wrote on 2023-07-06, 00:04:
mdrejhon wrote on 2023-07-05, 19:17:
Also, WinUAE does this (kudos to Toni who understood that it can be somewhat combined): Surge-execute method can be combined wit […]
Show full quote

Also, WinUAE does this (kudos to Toni who understood that it can be somewhat combined): Surge-execute method can be combined with beam-racing though, if you have enough refresh rate.
...
But in all cases, combining lagless vsync + surge execute at the same time in WinUAE (which it can do), is the best-of-all-worlds latencies. This is why enabling lagless vsync at 120Hz feels lower lag than lagless vsync at 60Hz -- it's surge-executing a beam-raced 60Hz refresh cycle in a 1/120sec fast-beamrace with every other 120Hz refresh cycle. And even lower if you own a 240Hz monitor (I recommend the new 240hz OLEDs, by the way) and if WinUAE is capable of executing 4x faster than a real Amiga. Very compute intensive, but the lowest possible (original-Amiga-beating) latencies.
...
WinUAE is pretty highly configurable in how you reduce its latencies. WinUAE can skip lagless vsync and instead use the surge-execute technique, if you have a platform fast enough to surge-execute WinUAE refresh cycles. But it won't beat the latency of 60Hz-real-world "lagless vsync" unless you own a really high refresh rate (240-360Hz) monitor, and have a computer fast enough to surge-execute WinUAE by 4x-6x.
...
Standing ovation to the smart emulator developer if the emulator supports both concurrently (WinUAE).

I love WinUAE, and Toni deserves a medal, no doubt. That emulator is a piece of art.

Agreed!

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 11 of 20, by Rincewind42

Posted on 2023-07-06, 03:31

Rincewind42 Offline

Rank Member

Rank: Member
Posts: 137
Joined: 2022-03-29, 01:42

mdrejhon wrote on 2023-07-06, 00:44:

You said you're using VRR, right? VRR is typically more common on higher-Hz monitors (e.g. 165Hz). So most of the lag reduction is caused by VRR's fast scanout. Those "70Hz" refresh cycles are scanned out onto your screen at max-Hz speed (e.g. 1/165sec for 165Hz monitor). 1/165sec is ~8.2ms less than 1/70sec.

Even if you emulate real-time 1/70sec into a refresh cycle, the frame is still delivered promptly to the actual display -- simply due to VRR capable of delivering low-Hz frames at max-Hz velocity over the cable.

Nope, I'm using a Dell U2414H which can only go up to 60Hz on paper, but with a custom resolution created in CRU with "CVT reduced blank" enabled I can push it up to just a bit below 71Hz. Perfect for DOS gaming, plus it can do to 50Hz too (no wonder because it has HDMI input), so that covers WinUAE with PAL games.

I actually think the reduced lag in DOSBox is just a byproduct of emulating the DOS machine for 1ms "emulated time" intervals (but as fast as we can), and we can respond to inputs any time, not just at the end of the frame (in theory, there's surely limits imposed by the emulated device drivers or the program itself that polls the I/O ports directly at hardcoded intervals).

So DOSBox literally does this: surge emulate the machine for 1ms of "emulated real time" (this is what we call "tick") but as fast as we can, idle, surge emulate for another 1ms, idle, and so on. Inputs can be *potentially* polled at the start of every 1ms tick, I'm pretty sure.

mdrejhon wrote on 2023-07-06, 00:44:

...Confirming "Tick" terminology clarification: I'm assuming one tick = 1ms, not the old timer interrupt that are 18.2Hz. 1ms surge executes would only reduce latencies by 1ms, not by a full refresh cycle.

As per above, we're emulating in "1ms surges".

mdrejhon wrote on 2023-07-06, 00:44:

Do you understand why VRR is the major latency reducer, even without surge-execute (even no tick-based surge executes)?

Yep, that part is clear.

mdrejhon wrote on 2023-07-06, 00:44:

Do you get the (rough, general, approximate) "frame of reference" concept I'm trying to explain?

I think yes, but I'll read it a few more times when I'm a bit more rested 😀

About the differences between various audio latencies (latency-to-video vs latency-to-keypress), that's for the gaming aspects and I think the requirements are a bit more relaxed for games than for musical applications. I've been a hobby musician for over three decades now, and the Amiga felt like a real instrument when using ProTracker. The keyboard acts as a piano keyboard, and you can play samples by pressing the keys, just like on a piano. Yeah, there must have been some non-zero delay there, but it felt instantaneous. From memory, I think the minimum perceptible timing difference is about 5ms for good musicians (not saying I'm necessarily good, but that's what the research says 😁).

More importantly, as I've discovered myself in actual practice when writing some simple music programs, *constant* latency is *way* more important than low but variable latency, and I'm pretty sure that's universal, there's no individual preference or variation on that. To put that in context, I'd take a fixed 20ms latency from button press to hearing the sound any day than a *variable* random latency that can go from 0 to up to 20ms! While 20ms latency is not ideal, one can quite well adjust to constant delays when playing an instrument, while you absolutely cannot compensate for variable latencies... it's actually super frustrating to play like that! So yeah, I'm quite certain that virtually everybody optimises for potentially higher but stable, constant latencies in musical applications.

If you think about it, in a band context it takes a few milliseconds for the sound from a guitar amp on stage to arrive to your ears. People can and do adjust to that. But if you watch video recordings of great bands with tight timings, you'll often see the bass player and the drummer (the rhythm section), or even the guitar player and the drummer watching each other and synchronising by visual cues. That's quite understandable, as you can get much more accurate syncing by doing it visually than by relying on the sound alone because of the speed of light. It's just something that basically all good musicians figure out instinctually over time, they don't even think about it much. Also, watching the drummer while you play guitar is an interesting thing as you literally get a "look-ahead" into the future, meaning that the drummer has to raise his hands before hitting the drums, and his movements have very predictable latencies, so you can literally sync to that! This is all just instinctual, but it's interesting that it's scientifically explainable (well, everything is 😀).

Anyway, just some interesting side observations about audio syncing 😀

Reply 12 of 20, by mdrejhon

Posted on 2023-07-08, 00:20

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Rincewind42 wrote on 2023-07-06, 03:31:

mdrejhon wrote on 2023-07-06, 00:44:

You said you're using VRR, right? VRR is typically more common on higher-Hz monitors (e.g. 165Hz). So most of the lag reduction is caused by VRR's fast scanout. Those "70Hz" refresh cycles are scanned out onto your screen at max-Hz speed (e.g. 1/165sec for 165Hz monitor). 1/165sec is ~8.2ms less than 1/70sec.

Even if you emulate real-time 1/70sec into a refresh cycle, the frame is still delivered promptly to the actual display -- simply due to VRR capable of delivering low-Hz frames at max-Hz velocity over the cable.

Nope, I'm using a Dell U2414H which can only go up to 60Hz on paper, but with a custom resolution created in CRU with "CVT reduced blank" enabled I can push it up to just a bit below 71Hz. Perfect for DOS gaming, plus it can do to 50Hz too (no wonder because it has HDMI input), so that covers WinUAE with PAL games.

Interesting.

You might be getting the framebuffer-backpressure relief effect on VSYNC ON (where you are delayed by 1 less buffer in double buffering), of your emulator better synchronizing to the VSYNC ON refresh cycles, similar to what is done via Low-Lag VSYNC HOWTO.

Rincewind42 wrote on 2023-07-06, 03:31:
I actually think the reduced lag in DOSBox is just a byproduct of emulating the DOS machine for 1ms "emulated time" intervals (but as fast as we can), and we can respond to inputs any time, not just at the end of the frame (in theory, there's surely limits imposed by the emulated device drivers or the program itself that polls the I/O ports directly at hardcoded intervals).

Everything else equal, you'd only get 1ms less latency. But you're keeping timing accuracy for some software (music software), at the cost of adding lag for other software.

Your your lag reduction is that the change to your refreshing algorithm removed a VSYNC backpressure effect. Basically an accidental effect from the change.

It's like how a cap 0.1fps above VSYNC ON refresh rate will usually lag 1 refresh cycle more than a frame rate cap 0.1fps below VSYNC ON refresh rate. Exact decimal refresh rate at https://www.testufo.com/refreshrate as CPU clocks are not in lockstep with GPU clocks) -- it's a flagship part of the old Low-Lag VSYNC HOWTO on Blur Busters for gamers who want to force their existing VSYNC ON video games to have 1 frame less latency.

This is because if the game is allowed to blindly have a frame rate exceeding VSYNC ON refresh rate (without a precision waitable swapchain in Full Screen Exclusive mode), it will stuff up the double buffering, and you have maximum buffer latency between the game and the screen. So 2 refresh cycles latency can build up, for a fully backpressured double buffer system where Present() begins blocking your execution because the double buffer is full, and is waiting almost a full 16.7ms before double buffering frees up and accepts the frame into the queue, and returns from Present() ... But if you run at frame rates below refresh rate, you have only 1 refresh cycle latency (approx) from an emptied-out double buffering system.

That's why Present() is non-blocking during VSYNC ON frame rates below refresh rates, and why Present() blocks during VSYNC ON frame rates above refresh rate.

Your algorithm just accidentally prevents the VSYNC backpressure.

What you think is reducing the latency, is actually from something else. A cause-and-effect, with an assumed cause that's actually a different cause...

I guarantee you will get further latency reductions if you do the 16.7ms surge-execute all at once, in a properly input-delayed surge-executed manner just before a VSYNC, and I think you will be surprised. Just make sure you don't let VSYNC ON backpressure itself (e.g. Present() returns virtually instantly). Because of surge-execution, all inputreads are occuring close to the VSYNC, and you can get less button-to-photons latency. Obviously, if code does millisecond-accurate audio executions upon button presses, those surge-executes may produce an audio desync effect -- but not always -- many DOS games sync the audio to the frame instead, and thus you will get less audio latency.

This is simply because in non-beamraced vsync workflows, you're having to scanout to an emulated framebuffer, before you even begin to scanout to the real framebuffer.

Maybe try this, and please make it a configurable setting. Give the end user a choice. It's not particularly difficult programming, as long as you've got the 1ms surge-executes working properly (even proper time-dilation of the system clocks) at smaller intervals. But you'll also need to combine it with an inputdelay technique (either autodetect or user-configurable), for it to beat the 1ms method. Autodetect would be self-measuring how long a surge-execute of a dosbox emulated refresh cycle took, and then making sure you /begin/ the surge-execute of emulation that much time (plus safety margin) before the next real-world blanking -- some of which can be detected from a release of a waitable swapchain.

Rincewind42 wrote on 2023-07-06, 03:31:
So DOSBox literally does this: surge emulate the machine for 1ms of "emulated real time" (this is what we call "tick") but as fast as we can, idle, surge emulate for another 1ms, idle, and so on. Inputs can be *potentially* polled at the start of every 1ms tick, I'm pretty sure.

That definitely isn't the cause of your major latency reduction. Your switch to the 1ms surge execute algorithm simply tamed the 16.7ms cycle to prevent the VSYNC backpressure behaviors.

You can also test VSYNC backpressure behaviors by creating a 69.5 Hz refresh rate versus a 70.5 Hz refresh rate. The 69.5Hz refresh rate will suddenly lag 1/70sec more than the 70.5Hz refresh rate, due to the backpressure behavior (double buffering staying nearly empty that Present() returns nearly immediately) versus (double buffering being stuffed to the point where Present() blocks). That's massively more than a 1ms difference. Try it, and you might be shocked at the lag differentials in fast-button-press games -- and then start to realize I'm right. Assuming you're not using an auto-frame-dropping algorithm to prevent blocking Present()'s, that is (another workaround some emulators have done before).

For 90% of DOS gaming, I bet you will have less latency with the full 1/70sec worth of surge-execute, rather than 1ms surge-executes. So I think it should be a configurable setting.

Rincewind42 wrote on 2023-07-06, 03:31:
I think yes, but I'll read it a few more times when I'm a bit more rested 😀

I'd love more of yoru commentary.

Rincewind42 wrote on 2023-07-06, 03:31:
More importantly, as I've discovered myself in actual practice when writing some simple music programs, *constant* latency is *way* more important than low but variable latency, and I'm pretty sure that's universal, there's no individual preference or variation on that.

This is true when there's no other retimes involved -- e.g. music without a game/visual/interactive stimuli. Basically a piano program WILL work better with the "1ms surge execute" method. I agree.

Now, if a game is involved -- they will often tick their audio to the frame instead of the button press because they have to keep the button-press audio in sync with other background audio (enemies, competitors, music) that are tick-tocking to the frame, instead of tick-tocking to button press -- so the button press, sometimes, is registered to only execute actions (audio, movement) synchronized to the new video game frame. It's a common gaming workflow in many games. And thus, audio will not lag in a full-frame surge-execute.

And you still get to eat cake too: less lag between button-to-visuals. So full-frame surge executes = no downgrades in lag to audio and visuals in many DOS games, provided you don't screw up the VSYNC ON frame buffer backpressure behaviors when changing the algorithm (The "something else improved/worsened" mistaken lag assumption effect that many developers do). So PLEASE make it a configuration option.

You may be surprised. Wink.

Your algorithm is good (not useless):
- It refactored the sync workflow and improved timers massively to avoid VSYNC backpressure; and
- It allows 1:1 sync between input and audio in many critical situations such as music software
But:
- Avoiding full frame surge-execute leaves further lag improvements on the table (and sometimes further audio lag improvements too!) in DOS games that ticks audio events to their frames / their refresh cycles.

But remember, full-frame surge-executes MUST be combined with accurate inputdelay techniques (pausing emulation until just a few milliseconds before the displays' real VSYNC -- in order to execute ALL inputreads of the ENTIRE emulated refresh cycles as low-lag-as-possible before the real displays' refresh cycle)

So, I suspect some latency assumptions are being made in your brain (by accident, understandably) is your thinking this optimization is more universal than you think -- I'm just here to educate on the Present()-to-Photons behavior, and to inform you about the variable VSYNC ON backpressure behaviors, that may explain your full 1/70sec reduction in latency. But ordinary non-beamraced VSYNC ON is still "finish emulated scanout before you begin true real world scanout" mandatory latency, that you're definitely guaranteed (bet-mortgage) leaving on the table. Now, be very careful you don't accidentally resurrect VSYNC ON backpressure latency (the 1/70sec increase in latency) when implementing an optional full-refresh surge-execute algorithm.

Also, you may want to allow 0.1ms surge-executes too, for improved piano software behaviors, especially with 4000Hz+ gaming keyboards. You will have further lag improvements in all emulators with higher-hz gaming keyboards with MartyPC, even if the underlying emulator doesn't support that Hz precision of keyboard events, and even if the emulated keyboard Hz in the underlying emulator is much lower than the real keyboard Hz. So, less-than-1ms can have less input lag for certain apps such as piano apps, possibly. Maybe make surge execute intervals configurable from [0.1ms ... full Hz] as a continuum, at least for experimentation?

Games that tick audio-sequence starts to the frame clock, will behave differently than music software that tick audio-sequence starts immediately to the input. So they will have different lag behaviors. You're still having to slow-scanout to an emulated framebuffer, long before it Scan Line #1 begins scanning out on real screen.

Anything not lagless vsync (beamraced sync like WinUAE) are a pro/con compromise. So configurable, please? You cannot have your cake and eat it too universally in all software due to the forced latency caused by "must scanout to emulated framebuffer fully, even before scan line #1 scansout to the real display" effect, of modern graphics APIs.

This common (understandable) assumption, found amongst many emulator developers, is creating improvements for a lot of software simultaneously (Because it solves a lot of latency problems) -- but is still leaving a lot of further latency reductions closer to original machine on the table -- if not implementing beamraced vsync like WinUAE for universal "zeroing out to original latency" and "zeroing out inputread/music/button behaviorial differences" in all software. That won't be universal with 1ms-surge-execute workflows, so you improve most software, but are still leaving further easy latency optimizations on the table. There is definitely no way to achieve unviersal original-machine lag symmetry universally in all your software -- in that your optimization won't equally help all software -- without switching to a lagless vsync workflow. It's the law of physics of emulator-must-scanout-fully-before-real-scanout-begins -- which forces you to make latency algorithm compromises that affects some software more than others.

Is it easy to make the surge-execute-amount configurable for experimentation? (0.5ms, 1ms, 2ms, 4ms, 8ms, full frame), while adding optional input-delay configurability? (Make sure your dosbox statistics display, or console output lets you know how many milliseconds per emulated refresh cycle dosbox is uzing -- to make it easier for user to configure an inputdelay). And yes, full frame surge execute will perform lag-worse on non-VRR displays if fully backpressured without intentional configurable inputdelay. While you can mostly skip inputdelay when it comes to VRR displays (unless you're using it to improve VRR framepacing in variable-overhead dosbox execution).

(Big rabbit hole indeed!)

Right Tool for Right Job, y'know?

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 13 of 20, by Rincewind42

Posted on 2023-07-09, 06:31

Rincewind42 Offline

Rank Member

Rank: Member
Posts: 137
Joined: 2022-03-29, 01:42

Thanks @mdrejhon; a rabbit hole, indeed 😀 LIke I said, I'm the audio guy; I just know enough about graphics to be dangerous. I'll pass this on to the guy who did the frame-presentation work for VRR monitors. Your valuable insights will definitely help if/when we get around to implementing these ideas. Much appreciated!

Reply 14 of 20, by mdrejhon

Posted on 2023-07-10, 02:20

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

mdrejhon wrote on 2023-07-08, 00:20:
This is because if the game is allowed to blindly have a frame rate exceeding VSYNC ON refresh rate (without a precision waitable swapchain in Full Screen Exclusive mode)

For those curious about waitable swapchains, here's the Microsoft page:
https://learn.microsoft.com/en-us/windows/uwp … 1-3-swap-chains

Basically traditional flip model swap chains just tries to buffer the frame, but if you're running a frame rate above Hz -- even if only a 0.1fps cap above -- that's why you get 1 frame more latency for many emulators during a real world refresh rate even just 0.1Hz - 0.5Hz below the emulator's emulated refresh rate (especially if you're using VSYNC ON + using a CPU clock such as RTDSC/QueryPerformanceCounter for computing emulated refresh rate). This is because Present() blocks, because the double buffer / frame queue is full, and can't accept the frame. That delays the input-to-photons even more.

So if emulator Hz is 70Hz, but your real fixed-Hz is 69.9Hz (custom resolution mode), you can get 1 refresh cycle more lag (1/70sec) than if your real Hz is 70.1Hz (custom resolution mode) -- due to the double buffer backpressure effect. Buffer eventually clogs up to maximum VSYNC ON latency, if emulator frame cap (Hz) is higher than real fixed-Hz display during VSYNC ON. This can persist even if you have a framedropping algorithm, due to that extra frame already in the queue (double buffering in graphics driver) before you realize you needed to drop a frame (without checking swapchain status, ala waitable swapchain logic or similar)

Worse, even if you create exactly the same refresh rate as DOS, the CPU and GPU won't agree on it -- and you may have an extra frame lag on some systems and less lag on others, depending on how fast/slow their GPU ticks relative to the CPU (and they will both tick differently than an atom clock) -- since creating VGA 70.0000000 Hz on one GPU may be actually 70.038853453Hz or 69.9953135Hz by an atomic clock. Clock precision on GPUs can vary a lot.

Even when a DOS program measures a CGA adaptor and reads 59.928Hz exactly perfectly every time (because both the 8088 and CGA ticks lockstep, unlike modern CPU-vs-GPU), it may actually still calculate as 59.9100734Hz or 59.93367Hz if stopwatched by an atomic clock. (The accuracy margin varies widely depending on what kind of clock chip or crystal is used, and its error margins, and if there's been some unnoticed degradation or not).

Ideally, dosbox should already be using a waitable swapchain too (and synchronizing to it, e.g. intentionally running dosbox slightly slow at 69.8Hz and running dosbox slightly fast at 70.2Hz -- to avoid lag, to avoid audio skips/pops and to avoid frame lag). I remember many years ago when some emulators (I think MAME?) switched to the waitable swapchain model, and got (about) a frame less latency as a result.

There is always a latency-sawtooth effect on fixed-Hz displays when running emulated refresh rate off any non-GPU clock such as RTDSC or QueryPerformanceCounter or std::chrono::high_resolution_clock::now

- CPU and GPU have different ideas about how fast an atomic clock would tick as
- CPU and GPU will have different clock drift on different systems/configs (and change over the same hour, same day, or over long time, whether current thermals, or from aging, etc)

Clock between CPU and GPU is a problem

From the Low-Lag VSYNC HOWTO for video games, to cap a frame rate fractionally (e.g. 0.1 below what is seen at https://www.testufo.com/refreshrate#digits=8 because "60.000Hz" in NVIDIA Control Panel may actually be clocking at 60.034 Hz or 59.983 Hz or whatever, because the CPU clocks and GPU clocks ticks out-of-phase of each other and one clock may run slightly slower / slightly fast -- and the clockspeed slew can change with thermals too! So the best workaround is to make dosbox intentionally "tick" to the fixed-Hz on the waitable swapchain model, and intentionally running slightly slow / slightly fast.

The browser Hz-measuring webpage is VERY accurate when run for a long time, because it is a snap-to-grid (one-dimensional grid, consisting of a VSYNC heartbeat) algorithm that ignores missed VSYNC's by recognizing when future VSYNCs "aligns" with previously successfully listened VSYNCs. So it is very accurate when run for 30 minutes, even when the page sometimes stutters (drops frames). It is based on the very great HZ.js algorithm from https://www.vsynctester.com that I got permission to use for that specific TestUFO, so it's a very good VSYNC listener for ultra-accurate refresh rate measurements accurate enough to detect clock drift between CPU and GPU! I even validated the TestUFO refresh rate measurement system to a photodiode oscilloscope -- and it's shockingly accurate "CPU representation of what it thinks the GPU refresh rate is". (Neither CPU or GPU will match the atomic clock - so that's three out of sync disagreements)

On an original IBM, the IBM CGA was able to tick lockstep to the 8088 CPU, thanks to bus clock in sync with both. But this isn't the case today anymore with modern GPUs. External GPUs (AMD, NVIDIA) tick on its own independent onboard clock, and there is always clock drift between the motherboard CPU and the add-on GPU. Microscopic as it may -- but is a consideration for synchronization. Even if "CPU emulated Hz"-and-"GPU real Hz" beat-frequencies at 0.01Hz, it's still a drift. And the drift can change with thermals too.

Now if you attach ultra accurate clocks -- even original IBM PC 5150's running side-by-side actually drifted apart too, even though they are the same refresh rate, one can tick slower than the other, whether it's the 2nd digit, 3rd digit, 4th digit or 5th digit of the refresh rate. Even if onscreen Hz measurements say "70.028" an atomic clock may say they're running "69.998" or "70.034" Hz for the two side-by-side 5150's -- especially if there's more degradation. The CRT tubes don't care because fixed Hz tubes have an allowance for a few percent of drift away from the real Hz. And most humans can't tell if all music is globally is changed less than 0.1% in pitch since all the sounds are pitch-shifted (no beat frequency or tuning fork style reference!) to the tick-fast behavior or tick-slow behavior. Computers will run faster/slower like a real clock. It's just like the two separate IBM PC 5150 is their own separate time-dilation bubble, even if it takes one week for the two systems to drift apart by 1 second (usually drifts apart way faster than that - I've seen seconds per hour drift between two identically configured old computers).

The considerations are the same for emulator too, but even WORSE, because of a drift between CPU clock and GPU clock. D'oh! Same-system clock drift, that creates additional emulation considerations.

I don't know if dosbox has a setting for syncing to real world refresh rate -- but dosbox have the option to tick fast / tick slow if using a waitable swapchain + a refresh rate that's reasonably close -- e.g. someone configures 70.000 Hz, and dosbox should tick to that ideally, if possible at that -- even if fractionally slow (far less than 0.1% slower). There are times when dosbox should tick to the CPU-clocked refresh rates (e.g. for variable refresh rates, or for emulated Hz very different from

Latency Sawtooth Effect (can be every few seconds, or minutes, or hours)

Clock drift between CPU and GPU can create Hz-vs-Hz phase drift problem (sawtooth latency effect), if you're not intentionally microscopically speeding up / slowing down your emulator to sync to a microscopically-drifting real-world GPU Hz (swapchain signal, VSYNC signal, or Present() blockage monitoring algorithms, etc):

In other words, a ferocious pick-poison:
- You may have improved audio latency-consistency behaviors in piano software (easier to debug).
- But you may have worsened button-to-visuals latency consistency (harder to debug) and audio-to-visuals latency consistency in game-frame-triggered audio events (harder to debug too).

It's possible to fix audio issues when you decide to sync to a GPU clock, though -- but it takes some good programming to smooth over audio issues.

This will never show up in audio because the audio correctly ticks fast or ticks slow (humans can't tell pitch changes when a CD is played less than 0.1% fast or less than 0.1% slow -- the 44.1KHz may actually read on an atomic clock as 44.0973 KHz or 44.113199 KHz, even if the computer reads it as a 44.1 perfection -- but it's so microscopic and it still ticktocks correctly (if no audio component is re-timing the slightly-off sample rate). But will show up in an ultra-precise latency tester (e.g. ultra sensitive photodiode oscilloscope + data logger + data processing).

So yet another rabbit hole, just freshly opened for you. Most emulator developers don't bother to use photodiode oscilloscopes, but with the boom of OSRTT and LDAT etc (easier to use than an oscilloscope) -- and the possible release of the Blur Busters ultra-precise latency tester in the future -- which is why I have noticed a lot of side effects like these too.

An emulated refresh rate that's not perfectly sync'd to the real refresh rate will have a sawtooth latency effect (even if lower than VSYNC ON), as the two refresh rates "slew against each other". This is an acceptable side effect (average latency reduction) for some use cases -- but sometimes unacceptable side effect for certain applications, even if the slewing (Beat-frequencying) occurs only once every 30 seconds, or only once every 5 minutes. It may never ever show up in a piano app (button-to-audio lag), and only show with visuals (button-to-photons lag).

The ONLY way to avoid this "lower lag but inconsistent lag" behavior is to let the emulator sync to the display VSYNC (with no framedropping algorithm), good to solve the problem of GPU and CPU drifting microscopically (example of clock drift between CPU and GPU -- https://www.testufo.com/refreshrate#digits=8 and run for 30 minutes -- still drifts usually at the 4th digit, often due to thermals! Some systems are less accurate and still drifts on 3rd digit, and other systems more precise at 6th digit -- but run it again tomorrow and it may be different)

So these would be two (Generically) good fixed-Hz frame presentation algorithms.
These are independent of whatever inputdelay algorithm or whatever surge-execute algorithm you do (different topic altogether in the behind-the-scenes technical emulation-timing orchestra):

(A) Waitable swapchain algorithm + optional mode to let emulator run slightly fast/slightly slow for fractional (say, sub-0.5Hz) refresh rate difference between emulated and real;
- Great for fixed-Hz mode such as 70Hz mode. ("Sync to monitor's refresh rate"). Even if not beam raced, emulator refresh cycle is in relative time-sync with real world refresh cycle

(B) Waitable swapchain algorithm + but let emulator refresh rate sync to microsecond CPU-derived clock instead (e.g. RTDSC or QueryPerformanceCounter) + framedropping algorithm that intentionally drops frames that will otherwise create a blocking Present(). Basically don't even call Present() if the flag tells you that it will block. If you framedrop that way, you've got lower-lag framedropping. Emulator refresh cycle is out of sync with real world refresh cycle, but at least at a very low latency compared to traditional page flip.
- Great for fixed-Hz mode that is very different from the emulator refresh rate
- Great for VRR mode

Simplest Potential Compromise

It could be a configurable option such as

Sync Emulated Refresh Rate To: [Internally Clocked (Best for VRR)] | [Display VSYNC] | [Automatic]

- "Internally Clocked" will prioritize realworld-accuracy of emulated refresh cycle, at cost of potential latency-slew (on non-VRR displays), and will automatically drop emulated refresh cycles if display Hz is lower than real Hz (like dosbox already does today). There is always a latency-slew effect in this mode (on non-VRR display), even if it "beat frequencies" every 10 minutes or every 30 minutes.
Possible CPU clock source for emulator Hz: RTDSC, QueryPerformanceCounter(), std::chrono::high_resolution_clock::now(), etc.

- "Display VSYNC" will make GPU the master clock, prioritizing accuracy of emulated refresh cycles to real display refresh cycles, even if display Hz is slightly off. It is the user responsibility to try to create as-accurate-as-possible custom Hz as exactly as possible (just will never be exact CPU-vs-GPU lockstep). There is never a latency-slew effect in this mode, as the emulator will successfully speedup/slowdown with GPU spec error margins and clock drifts (thermals too), since this setting considers the GPU the master clock. This will mean 70Hz DOS content runs slow if run on a fixed-Hz 60Hz display with no 70Hz modes available but preferable to many people who are visuals (remember, I'm born deaf -- so I'm all about game stutters).
Possible GPU clock source for emulator Hz: Any VSYNC-detecting API (refresh cycle interval detection) such as waitable swapchain, or Present() blockage detection, D3DKMTGetScanLine's "InVBlank" flag, etc. Equivalent APIs also in Linux and in Mac, if diving deep enough. Your existing graphics framework may already have a method

- "Automatic" is the user friendly setting that will use Display VSYNC if current emulated Hz is within 0.5Hz of an available real-world display mode (e.g. custom 70Hz), otherwise internally clocked (which is what will happen automatically on a high-Hz VRR display, due to high max Hz far beyond DOS Hz).

You could have a configurable Automatic margin (1Hz, 0.5Hz, 0.1Hz, 0.001Hz). You will NEVER get perfect lockstep between CPU clock and GPU clock, and you will NEVER avoid the latency-slew effect (on visuals, not audio) if you tick only to the CPU clock (not display VSYNC) . So you can't have cake and eat it too, for this specific lineitem either (unless using VRR+Automatic). And if you decide to beamrace your refresh cycles, this could be a fourth setting called "Lagless Vsync" , which would be a beamraced version of "Display VSYNC". Only if that setting is ever added later in the future.

Some emulator developers may prioritize to sound behaviors by monitoring audio buffer state (and thus, defacto sync emulator to audio chip clock, not CPU clock or GPU clock), so this could be another setting, if necessary. However, if you do things algorithmically correct and precisely enough, you should have good pop-free audio for all modes (at reasonably close refresh rates). The annoying thing is that a 0.1% drift is more noticeable in audio than in visuals. But imagine walking up to a dosbox game and noticing it is a bit more laggy one hour than the next -- you may be witnessing the cyclic latency sawtooth effect. If display Hz and emulated Hz are very close, the latency sawtoothing will even go hourly rather than every few minutes. But still an annoyance if your muscle memory doesn't work one hour than the next.

It may be that some versions of dosbox may already have these equivalents of mode (I haven't tried the PC version of dosbox recently enough, alas -- to know if they finally added this!)
But if not added yet, I think these 3-settings are fairly easy to do (without modifying the 1ms surge-execute workflow)

This isn't mutually exclusive to the Hz autoswitching (e.g. switching to 60Hz vs 70Hz). But this gives the user the option to which "master clock" (or "main clock") for the emulator to run with, with its attendant latency pros/cons.

This post has nothing to do with surge-execute and how much you do (which can still be done with or without any of the above approaches).

I probably dumped only the 101's (basics)... I have WAY more help I can offer about the Present()-to-Photons rabbit hole. Ask away!

Last edited by mdrejhon on 2023-07-11, 08:21. Edited 10 times in total.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 15 of 20, by mdrejhon

Posted on 2023-07-11, 02:13

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Updated my post with a few edits (potential CPU-vs-GPU clock sources).

In short, stuff like RTDSC/QueryPerformanceCounter is a CPU clock source.
- Good for best & consistently low latency performance on VRR (since VRR = software-initiated refresh cycles)

And fixed-Hz VSYNC-monitoring (also swapchain-monitoring during VSYCN ON) is a GPU clock source.
- Good for avoiding button-to-visual latency issues on fixed-Hz - best 'simple algorithm' in performance/volatility/slew

Many emulators that I run, is able to switch between the two clock sources for timing their emulator refresh cycles, even if it's just a simple "Sync to Refresh Rate ON/OFF" setting (defacto switch between CPU-vs-GPU clocking).

Various Possible APIs Whenever Using GPU As Master Clock For Emulated Hz

Using various settings of various terminologies ("sync to refresh rate" vs "60fps"), so this is widely implemented in many emulators. If they haven't already, I recommend Dosbox / MartyPC / other emulators do so too -- it is probably over 100x easier to implement this configurability than "lagless vsync" -- probably just a few hour's work, as long as you're familiar with a way of monitoring the GPU VSYNC. In a situation of unfamiliarity, it may take longer, but hopefully it's simple for your workflow. It can be reasonably simple to clock to the GPU clock in some graphics frameworks.

Your graphics framework may have a (roundabout) way of determining the VSYNC intervals, and thus you can clock to GPU to prevent a phase drift between your emulated Hz and real Hz (even if it takes an hour to drift in-and-out of phase)

The most cross-platform way would be blocking-behavior monitoring of your graphics API's respective frame-presentation event -- generally the timestamp of the unblock will be the real world VSYNC event -- and thus you've found a GPU clock source that makes it possible to guarantee constant latency (button to visuals) with only a minor update to existing frame presentation workflows -- but may be laggier than waitable swapchain approaches. The workflow of "Sync To Refesh Rate" is simply try to go flat-out at maximum frame rate, and it self-throttles to max refresh rate (because of the blocking behavior) during VSYNC ON (even in VRR mode, which will self-throttle at max Hz). Now you've found a GPU clock source!

Here are possible options:
- I also linked to Microsoft documentation on waitable swapchains.
- C# XBox/MonoGame: Update()+Draw() generally tries to tick to VSYNC
- JavaScript: requestAnimationFrame() generally tries to tick to VSYNC
- DirectX: Timestamping the exit from max-framerate Present() (exit from blocking behavior = timestamp of VSYNC). But try to use waitable swapchains instead to reduce lag!
- OpenGL: Timestamping the exit from max-framerate glFinish/glSwapBuffers for blocking behaviors (exit from blocking behavior = timestamp of VSYNC). (See above, though)
- Alternatively, there's sidechannel APIs to clock to the GPU:
.....Windows: D3DKMTWaitForVerticalBlankEvent() (...still works independently during OpenGL or Vulkan...)
.....Android**: typedef void (*HWC2_PFN_VSYNC)(hwc2_callback_data_t callbackData, hwc2_display_t display, int64_t timestamp);
.....MacOS/iOS: CADisplayLink or CVDisplayLink
.....Linux: Varies (not all X window managers are vsync'd, but kwin-lowlatency fork did a fantastic job, and I think was commited upstream to kwin)

**Android documentation talk about a "vsync offset". Terminologically, this is identical to what I call as "input delay" -- a time-offset for rendering (compositing) a frame versus the real GPU VSYNC interval. Always a recommended companion setting (and set to a conservative default), especially if you're using blocking behaviors to monitor VSYNC

Milliseconds Changes Causes Problem In Gaming

Remember, even a continual 2ms lag change (over time) can mean 2 millimeter "muscle memory" misaim for objects moving 1000 millimeters per second (e.g. archery arrows, moving targets, pinball balls, FPS enemies in esports, other fast moving objects etc). The latency sawtooth can be a full refresh cycle (60Hz = 1/60sec = 16.7ms) so a game may lag 16ms less or more the next minute (or hour) than the previous minute (or hour), depending on how fast the latency slew effect occurs between a CPU-clocked emulator refresh cycle and the GPU refresh cycle. It's as if the slew effect is wreaking havoc on your average human reaction time (aiming) - 16ms is over 10% of a 150ms human reaction time (well-attuned can be 100ms for button press reactions when ignoring the software and Present-to-Photons latencies, which is why you should subtract 2-3 refresh cycles from a web browser benchmark such as humanbenchmark). This can mean the difference between feeling you're playing expertly, or feeling you're not playing as well as you did on original machine.

Last edited by mdrejhon on 2023-07-14, 01:20. Edited 1 time in total.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 16 of 20, by GloriousCow

Posted on 2023-07-11, 12:50

GloriousCow Offline

Rank Member

Rank: Member
Posts: 497
Joined: 2022-09-12, 20:00

I've been looking into how to implement some of this stuff in MartyPC using the library stack that I have, and I've been scratching my head a bit.

Rust's standard timer is Instant: https://doc.rust-lang.org/std/time/struct.Instant.html
It's back-end implementations are listed there. I am not sure if all of them are sufficiently precise.

But I have two primary challenges.

Using Rust, I have a somewhat unique ecosystem to contend with. I can't just poke at the hardware or rendering system at my leisure.

I'm using the winit cross-platform windowing library: https://docs.rs/winit/latest/winit/
Since it creates an abstraction over several different windowing environments, access to some of the implementation specific details I might want for this is limited. In general, a program using winit is event driven; it performs updates and optionally rendering on Event::MainEventsCleared, which is emitted when all other events have been processed - which is not ideal for low jitter, I suppose. One starts to consider having a separate thread to smooth things out, but unfortunately on MacOS we are limited to the main thread for reasons I don't entirely comprehend...

My graphics library is wgpu; again an abstraction over several possible graphics backends (Vulkan, DX12, Metal, GLES or WebGPU) so access to specific backend implementation details is also again a bit limited. Trying to wait precisely for vblank in an event-driven method using wgpu might not be feasible(?)

So again I am thinking about and SDL frontend as a 'high-performance' option for MartyPC. As you know, a lot of emulators use SDL for cross-platform compatibility, so if this sort of thing is possible to pull off using the timers and rendering functions SDL provides, documenting that might be extremely useful. I confess the last time I used SDL it was SDL version 1; so I am not really sure what their current API looks like.

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Reply 17 of 20, by mdrejhon

Posted on 2023-07-11, 21:37

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

GloriousCow wrote on 2023-07-11, 12:50:
I've been looking into how to implement some of this stuff in MartyPC using the library stack that I have, and I've been scratch […]
Show full quote

I've been looking into how to implement some of this stuff in MartyPC using the library stack that I have, and I've been scratching my head a bit.

Rust's standard timer is Instant: https://doc.rust-lang.org/std/time/struct.Instant.html
It's back-end implementations are listed there. I am not sure if all of them are sufficiently precise.

But I have two primary challenges.

Using Rust, I have a somewhat unique ecosystem to contend with. I can't just poke at the hardware or rendering system at my leisure.

I'm using the winit cross-platform windowing library: https://docs.rs/winit/latest/winit/

Yes, this is a catch-22 but fortunately, most crossplatform graphics libraries will give you smooth frame rates for VSYNC ON if configured right --

Find any graphics API that does smooth framerate=Hz during VSYNC ON

Your litmus test is a tiny Hello World app (scrolling text, scrolling vertical line etc). Make sure it smoothscrolls at two highly-nondivisible refresh rates, e.g. when running at 50Hz and when running again at 60Hz (all 1080p desktop computer monitors support both 50Hz and 60Hz).

If frame rate (approximately) matches refresh rate, yay! If you can do that, you've found a way to monitor the GPU clock (in a very roundabout way)! Any API capable of perfect VSYNC ON fluidity with simple graphics -- can be commandeered as a defacto GPU clock monitor. VSYNC is a tick-tock generated by a GPU in fixed-Hz mode (non-VRR), so anything that aligns to that, is a monitorable GPU clock source.

Even if you have to use a de-jitter filter and a dropped-VSYNC filter.

There's a hybrid approach available: The VSYNC flywheel algorithm

TL;DR: Keep using CPU as the clock source, but nudge it imperceptibly faster/slower based on how out-of-phase it is with the GPU clock source.

Where emulator VSYNC is timed by microsecond CPU counters (RTDSC style), but the phase is "nudgeable" by the real world VSYNC, and eventually stabilizes over a few seconds to be aligned with VSYNC. What you can do is continue your software-generated emulator VSYNC approach, but a separate thread would monitor the hardware VSYNC and create a 'reference' clock that you can gradually slew your software-generated VSYNC to align with the hardware VSYNC. You'd only do imperceptible Hz-slewing (e.g. change emulator speed in 0.01% or 0.001% steps per frame), until the emuHz & realHz sync'd up. Over a period of 10 seconds, they sync up.

Plenty of precedents already.

- Some software (I think including Tom Harte's CLK) has a software flywheel behavior
- Also, some code released by ad8e on github also did a flywheel-style Hz-slewing.
- https://www.vsynctester.com JavaScript "HZ.js" (also used by https://www.testufo.com/refreshrate#digits=8 ...) has a Hz-listener that dejitters/averages and ignores dropped VSYNC events
- I did some experiments with Tearline Jedi's crossplatform VSYNC listener did something of that sort of thing, while using a snap-to-grid algorithm to ignore missed VSYNC's relative to the grid of successfully-listenered VSYNC events.

One can configure the coupling strength with a setting (e.g. changeable "0.1" vs "0.01" vs "0.001%" emulator speed). You want to set it to be so imperceptible that it generally will avoid audio distortions (as long as you have an audio algorithm that compensate for slightly too fast / slightly too slow / minor speed changes), but generally allows the emuVSYNC-vs-realVSYNC to phase-align with each other "eventually".

VSYNC timestamp dejitterer algorithm function/method

The concept is that you can just feed it frame-present-exit timestamps (same thread), or you can just give it an (imperfect & jittery & dropouted) VSYNC listener (different thread), and it'd give you a very accurate refresh cycle timestamp + interval --

BONUS (futureproofing): Even though you're only using this VSYNC dejitterer function for the purposes of timing your emulator refresh cycles more "in phase" with real refresh cycles -- a bonus side effect bonus is that (in the future, if you implement lagless vsync algorithms) -- is that a refined VSYNC timestamp de-jitterer is actually sufficiently accurate to guess the VSYNC location to a timing error margin of within ~1% or ~0.1ms -- This is accurate enough to guess the approximate current real world raster scan line number of the modern GPU output -- which is why I used this approach for cross-platform guessing of current real world raster (emulation of raster interrupts that works on modern APIs capable of VSYNC OFF tearing, for precision tearline-steering). This allowed the same C# MonoGame program to generate the unscreenshottable crossplatform real-raster Kefrens Bars on high-framerate-capable GPUs running both PC and Mac when recompiled for either (OpenGL + VSYNC OFF). Sometimes was down-shifted or up-shifted by about 1% screen height due to incorrect raster scanline number guesses. And sometimes the vertical position drifted slowly up and down by 1% over the period of seconds or minutes -- as it tried to flywheel-momentum the guessed timestamp to real VSYNC. Or had a permanent slight vertical offset (due to unknown/undetected VBI size. And you have unfixable raster jitter of course, but the Kefrens, most certainly (as you see in the YouTube of Kefrens Bars on GeForce) is definitely recognizable even if it raster jittered up/down from guessed raster scan line numbers as a time offset between two VSYNC timestamps!

It's just plain old simple heuristics, but your math skill will have a direct effect on how good it dejitters / ignores dropped VSYNC, especially when occasionally fed bad data (e.g. occasional very-out-of-phase timestamps that can pollute the VSYNC-averaging data). You can't completely ignore those outliers, and there were side effects with most hard cutoff algorithms (creates bad effects, including some out-of-control resonancies when jitter slewed to the hard-cutoff threshold). Plot these timestamp-vs-real-VSYNC on a scatterdot graph, and you'll see a gradient cloud with strong density near the real VSYNC, usually. (To determine "real VSYNC" as a timestamp reference for your debugging scatterplot graphs, you can temporarily use D3DKMTGetScanLine() as a potential test-VSYNC reference for abuse-testing your VSYNC timestamp dejitterer/averager/estimator algorithm), and know your monitor's horizontal scan rate (check the ModeLine or ToastyX or NVIDIA for it) -- so you can do the math needed to generate a scatterplot graph of VSYNC timestamp jitter (guessed-vs-real error margins). This made a lot of us realize we wanted a math weighting effect based on how far out-of-phase a new VSYNC timestamp is, relative to past timestamps (in their snap-to-one-dimensional-grid centers), so they're effectively ignored if they're practically between two refresh cycles, but still partially weighted in if they're a 10% outlier that is 10%-offset from real. Also remember, on some platforms, you have more scatterplot density on one side of the line than the other, so you have to account for the fact that the timing-inaccuracy scatterplot-graph gradient is not symmetrical on both sides of the true-VSYNC timestamp reference line.

VSYNC timestamp-estimating functions (with dejittering and dropped-VSYNC ignoring) can be created in just 100-200 lines, but math needs to be refined correctly so it doesn't have bad resonance-frequency effects (bounces aggressively).

For some of the authors of the VSYNC timemstamp-estimating functions, 90% of the work was debugging destructive mathematic effects by intentionally skipping VSYNC events randomly + adding intentional random jitter -- to see how high quality the resulting VSYNC timestamps were (averages, volatility, standard deviation, etc). So while these are deceptively small 100-200 liners, they are sometimes hard to mathematically optimize by those who hadn't graduated university.

Since there's multiple precedents, I could try to dig up an open source (MIT, Apache) software Hz-dejitterer / Hz-slewer algorithm, which can give you a flywheel-style coupling between emulator Hz and real Hz -- to allow you to keep using a CPU-based VSYNC timer, which is simply nudged slightly faster/slower depending on how out-of-phase it is with the real VSYNC.

Last edited by mdrejhon on 2023-07-11, 22:53. Edited 4 times in total.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 18 of 20, by mdrejhon

Posted on 2023-07-11, 22:13

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Candidate Source Code For De-Jittered VSYNC Estimators

To save y'all time on VSYNC dejitterer existing code, not all are flywheel style algorithms, and approach vsync dejittering differently, but existing works of art:

Open source VSYNC djitterer used by TestUFO and VSyncTester
https://github.com/blurbusters/RefreshRateCalculator
LICENSE: Apache-2.0 (relicensed as of July 12, 2023!)
*** MOST ACCURATE: Dejitters and filters dropped vsyncs ***
---
Tom Harte CLK's cross platform VSYNC predictor/averager
github.com/TomHarte/CLK/blob/3e09afbb59bff64910b8fbcb83ef5e54d5604d94/ClockReceiver/VSyncPredictor.hpp#L24
LICENSE: MIT
---
ad8e's algorithm (compatible with both Windows and Linux), used to demos raster-accurate tearline steering
https://github.com/ad8e/vsync_blurbusters/blo … /main/vsync.cpp
LICENSE: BSD Zero Clause License (effectively public domain)

EDIT: Today, we released an open source VSYNC dejitterer -- https://github.com/blurbusters/RefreshRateCalculator
It's now the version I recommend for cross-platform emuHz-vs-realHz syncing (beamraced or not).
Having virtually no dependances, it is easily ported to any language.

Old text:
These will have varying accuracies depending on how powerful your GPU is, and if you have power management enabled/disabled. […]
Show full quote

These will have varying accuracies depending on how powerful your GPU is, and if you have power management enabled/disabled.

Ideally you would need neither, and you can simply time-align your emulator execution to the presentation-exit events.

But if that's something untenable (due to the need to use a CPU clock for low-jitter (avoid audio problems) -- which means you need to slew very slowly and gently to prevent audio dropouts. So, that's why the "flywheel coupling algorithm" are good alternatives to solve this problem. Where you use clock the emu refresh to CPU clock but loosely and gently nudgeable by a GPU clock source.

I inspected the code. Looks like Tom Harte's CLK is (by a wide margin) easiest to recycle/port, while ad8e's includes templates for cross platform raster-poll guess (that works on Linux and Windows). I think the best-of-both could be combined later, but I'd advise you to start with Tom Harte's code -- it's very simple/rudimentary. It does not seem to have missed-vsync ignoring, but it utilizes a very simplistic averaging over the last ~128 vsync's. But it does have frame duration calculation, which can be useful for input delay algorithms (vsync phase offsetting).

If you want accuracy, the JavaScript version is most accurate due to the jitter-duress that JavaScript has, and it had to do more to get accurate estimated VSYNC timestamps. If you run this algorithm using lower level language like C, C++ or Rust, you will get the most accurate VSYNC timestamps. ~~But it's in JavaScript + commercial and you will have to reach out to get a license.~~ - Relicensed as Apache 2.0 in collaboration with Blur Busters

EDIT: The Apache 2.0 version released by Blur Busters (in collab with Jerry Jongerius of vsynctester.com) is now the recommended version.

Last edited by mdrejhon on 2023-07-13, 03:06. Edited 7 times in total.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Reply 19 of 20, by mdrejhon

Posted on 2023-07-12, 22:14

mdrejhon Offline

Rank Newbie

Rank: Newbie
Posts: 32
Joined: 2023-07-01, 10:18

Update! I got one of the modules relicensed to Apache 2.0.

Source code now released in this thread:
CODERS: New cross-platform VSYNC timestamp estimator useful to emulator authors (sync emuHz to realHz)

Crosspost here, hidden in quotes:

Excellent news: Relicensing vsync estimator to open source […]
Show full quote
Excellent news: Relicensing vsync estimator to open source

- One "RefreshRateCalculator()" class object, self contained.
- About 200 lines of code (+ ~100 lines of comments)
- No external dependancies
- Easy to port to almost any language on almost any platform

Purposes for emulators

- Non-beamraced:
- .....This can simply be used for flywheeling emuHz slowly towards realHz. For flywheeling raster estimates based on time-offsets between vsyncs.
- .....This can be used for VSYNC phase offsets (input delay algorithms between emuHz and realHz).
- Beamraced:
- .....It is also accurate enough for beam racing applications, such as cross platform Lagless VSYNC.- Beamraced: It is also accurate enough for beam racing applications, such as cross platform Lagless VSYNC.

It's up -- https://github.com/blurbusters/RefreshRateCalculator

________________

RefreshRateCalculator CLASS

PURPOSE: Accurate cross-platform display refresh rate estimator / dejittered VSYNC timestamp estimator.

Input: Series of frame timestamps during framerate=Hz (Jittery/lossy)

Output: Accurate filtered and dejittered floating-point Hz estimate & refresh cycle timestamps.

Algorithm: Combination of frame counting, jitter filtering, ignoring missed frames, and averaging.

This is also a way to measure a GPU clock source indirectly, since the GPU generates the refresh rate during fixed Hz.

IMPORTANT VRR NOTE: This algorithm does not generate a GPU clock source when running this on a variable refresh rate display (e.g. GSYNC/FreeSync), but can still measure the foreground software application's fixed-framerate operation during windowed-VRR-enabled operation, such as desktop compositor (e.g. DWM). This can allow a background application to match the frame rate of the desktop compositor or foreground application (e.g. 60fps capped app on VRR display). This algorithm currently degrades severely during varying-framerate operation on a VRR display.

LICENSE - Apache-2.0

Copyright 2014-2023 by Jerry Jongerius of DuckWare (https://www.duckware.com) - original code and algorithm
Copyright 2017-2023 by Mark Rejhon of Blur Busters / TestUFO (https://www.testufo.com) - refactoring and improvements

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

*** First publicly released July 2023 under mutual agreement
*** between Rejhon Technologies Inc. (Blur Busters) and Jongerius LLC (DuckWare)
*** PLEASE DO NOT DELETE THIS COPYRIGHT NOTICE

JAVASCRIPT VSYNC API / REFRESH CYCLE TIME STAMPS

Info: https://www.vsynctester.com/howtocomputevsync.html
Used by both https://www.vsynctester.com and https://www.testufo.com/refreshrate
requestAnimationFrame() generally tries to syncs to VSYNC, so that is the source of VSYNC in web browsers, for deriving refresh cycle timestamps from. The longer this algorithm runs, the more accurate the refresh rate estimate becomes.
JavaScript Compatibility: ES6 / ECMAScript 2015 (Chrome, FireFox, Edge, Safari, Opera)
CODE PORTING
This algorithm is very portable to most languages, on most platforms, via high level and low level graphics frameworks.
Generic VSYNC timestamps is usually immediately after exit of almost any frame presentation API during VSYNC ON framerate=Hz
APIs for timestamps include RTDSC / QueryPerformanceCounter() / std::chrono::high_resolution_clock::now()
APIs for low level frame presentation include DirectX Present(), OpenGL glFinish(), Vulkan vkQueuePresentKHR()
APIs for high level frame presentation include XBox/MonoGame Draw(), Unity3D Update(), etc.
APIs for zero-graphics timestamps (e.g. independent/separate thread) include Windows D3DKMTWaitForVerticalBlankEvent()
While not normally used for beam racing, this algorithm is sufficiently accurate enough for cross-platform raster estimates for beam racing applications, based on a time offset between refresh cycle timestamps! (~1% error vs vertical resolution is possible on modern AMD/NVIDIA GPUs).

SIMPLE CODE EXAMPLE
1var hertz = new RefreshRateCalculator();
2
3[...]
4
5  // Call this inside your full frame rate VSYNC ON frame presentation or your VSYNC listener.
6  // It will automatically filter-out the jitter and dropped frames.
7  // For JavaScript, most accurate timestamp occurs if called at very top of your requestAnimationFrame() callback.
8
9hertz.countCycle(performance.now());
10
11[...]
12
13  // This data becomes accurate after a few seconds
14
15var accurateRefreshRate = hertz.getCurrentFrequency();
16var accurateRefreshCycleTimestamp = hertz.getFilteredCycleTimestamp();
17
18  // See code for more good helper functions
OPTIONAL: If you use this for cross platform "lagless vsync"

For cross platform beam racing, you'd do your code-ported version of this JavaScript (!) code:
Remember, you need VSYNC OFF while also concurrently being able to listen to the real displays' VSYNC.
1// Run this after a 10-second refresh cycle counting initialization at startup (but keep counting beyond, to incrementally improve accuracy sufficiently enough for beam racing apps)
2var accurateRefreshRate = hertz.getCurrentFrequency();
3var accurateRefreshInterval = 1.0 / accurateRefreshRate;
4var accurateRefreshCycleTimestamp = hertz.getFilteredCycleTimestamp();
5
6// Vertical screen resolution
7var height = screen.height;
8
9// Common VBI size for maximum raster accuracy, adjust as needed.  VGA 480p has 45, and HDTV 1080p has 45
10// Or optionally use #ifdef type for plat-specific APIs like Linux modelines or Windows QueryDisplayConfig()
11var blanking = 45;   
12
13var verticaltotal = height + blanking;
14var elapsed = performance.now() - accurateRefreshCycleTimestamp;
15var raster = Math.round(verticaltotal * (elapsed / accurateRefreshCycleTimestamp));
16
17// OPTIONAL: If your VSYNC timestamp is end-of-VBI rather than start-of-VBI, then compensate
18raster  += blanking
19if (raster > verticaltotal) raster -= verticaltotal;
While this will freaking actually (uselessly) work in a web browser (I got roughly ~5-10% raster scan line position accuracy guesstimated in a WEB BROWSER running on an NVIDIA GPU, fer crissakes), e.g. a raster guesstimate vertically 50-75 pixels on a 1080p display, NVIDIA-type GPU, i7-type CPU.

...This won't be useful for rasterdemos in a web browser since they're permanently VSYNC ON and do not generate tearlines (no way to listen to VSYNC ON tick-tocks while running in VSYNC OF mode) -- but will work with high-framerate VSYNC OFF standalone software, for precise tearline steering, but remember to Flush() before timstamping for more accurate raster guesstimates. But the fact, I could get a raster scan line estimate in a FREAKING WEB BROWSER to roughly a 5% error margin (on landscape desktop displays, i7 CPU, NVIDIA GPU)....

Don't expect any good rasterdemo accuracy on Intel GPUs (but it's sufficiently accurate enough for emulator frameslice beamracing)

Now, if you do want to raster-estimate a mobile LCD or OLED display, make sure you rotate to its landscape default orientation (top-to-bottom scanout, and verify with high speed camera on https://www.testufo.com/scanout creating videos similar to ala https://www.blurbusters.com/scanout ...) ... Some mobile displays scans sideways, and you can't detect scanout direction in javascript. Boo. However, GPUs scan top-to-bottom to the GPU output, so landscape-monitor-mode will always be displaying a signal that's being scanned top-to-bottom, so you can certainly cross-platform beam race that (more or less).

If ported to C#, you can get sub-1% accuracy, much like Tearline Jedi.

If ported to C / C++ / Rust and using lower level VSYNC listeners, you can sometimes (on high performance platforms) get even better accuracy to as little as 1-scanline on certain less-hyperpipelined GPUs, although likely with a fixed offset that needs to be compensated-for. Emulator frameslice beamracing only need a worse error margin of one frameslice worth of jitter (e.g. 10 frameslices per refresh cycle at 60Hz = 1/600sec beamrace jitter is allowed before artifacts appear)

Remember, this cross platform module need not necessarily be used for beamracing;

- Non-beamraced:
- .....This can simply be used for crossplatform nudging/flywheeling emuHz (CPU clocked) slowly towards realHz (GPU clocked) to prevent latency phase slewing effects
- .....This can be used for crossplatform VSYNC phase offsets (input delay algorithms).
- Beamraced:
- .....It is also accurate enough for beam racing applications, such as cross platform Lagless VSYNC.- Beamraced: It is also accurate enough for beam racing applications, such as cross platform Lagless VSYNC.

Last edited by mdrejhon on 2023-07-13, 02:24. Edited 7 times in total.

Founder of www.blurbusters.com and www.testufo.com
- Research Portal
- Beam Racing Modern GPUs
- Lagless VSYNC for Emulators

Go to top of page Go to top of page

Back to PC Emulation