VOGONS


Reply 60 of 79, by vutt

User metadata
Rank Member
Rank
Member
Marco Pistella wrote on 2026-05-02, 14:04:
For comparison purposes it would be very useful to repeat Command 4 on 800×600 and 1280×1024 with the original unmodified BIOS […]
Show full quote

For comparison purposes it would be very useful to repeat
Command 4 on 800×600 and 1280×1024 with the original
unmodified BIOS, if you still have access to it. That would confirm whether the jitter is BIOS-induced or
hardware-specific.

I have lost original bios. Not really looking forward to experiment with downloaded ones.
However I have similar PowerColor Radeon 9600 Pro (RV350) with orig BIOS. So I put it into same PC. Please note that I have CRT monitor - Samsung SyncMaster 765mb in case it plays any role.

Reply 61 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2026-05-03, 08:01:

@Marco Pistella: Do 64-bit reads/writes use FPU or MMX registers?
BTW, I very much like your test program 😀 I think this is the only available DOS tool that can measure such high bandwidths reliably.

Thank you — and glad you find it useful!

To answer your question: 64-bit reads and writes use FPU
registers via x87 (fild/fistp). MMX is intentionally not
used, to maintain compatibility with systems without MMX
support. The finit instruction can introduce significant
overhead on some CPUs, which may explain the 64-bit
anomaly you observed.

To investigate further: use the cursor keys to navigate
to the overhead screens in the benchmark results. These
show the routine overhead in PIT cycles for each access
width and allow a more accurate interpretation of the
raw numbers — particularly for the 64-bit case where
finit overhead may be the dominant factor.

Reply 62 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
vutt wrote on 2026-05-03, 09:13:

I have lost original bios. Not ... [CUT]

Thank you for the comparison with the original BIOS —
this is exactly the kind of controlled test that helps
isolate the cause of the anomalies seen earlier.

The pattern is irregular and the values are still high for a card
of this generation — which suggests the issue may
not be entirely BIOS-related.

To investigate further, could you press D on the
timing screen for each mode to display the horizontal
frequency sample distribution, and post the screenshots?
The shape of the distribution will help understand
whether the jitter has a specific pattern or is random
noise.

Reply 63 of 79, by vutt

User metadata
Rank Member
Rank
Member
Marco Pistella wrote on 2026-05-03, 10:16:

To investigate further, could you press D on the timing screen for each mode to display the horizontal
frequency sample distribution, and post the screenshots?

As you can see it's different every time.

Reply 64 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
vutt wrote on 2026-05-03, 12:52:

... [CUT]

Thank you for the distribution screenshots.

The pattern is interesting: almost all samples cluster
at the expected PIT value, but a few outliers appear
significantly above — some exceeding 9,000 and one
above 12,000 PIT cycles. This is not random noise —
it is a discrete, sporadic event.

Could you press + to zoom in on the distribution chart
around the outlier area, and post a screenshot?
Specifically I would like to see whether the outliers
cluster around a specific value (for example a fixed
multiple of the main peak) or are scattered randomly.
This will help distinguish between a spurious bit 3
event on port 3DAh — where the measurement accidentally
captures two retrace intervals instead of one — and a
genuine timing instability in the video signal itself.

Reply 65 of 79, by Falcosoft

User metadata
Rank l33t
Rank
l33t
Marco Pistella wrote on 2026-05-03, 10:07:
.... To answer your question: 64-bit reads and writes use FPU registers via x87 (fild/fistp). MMX is intentionally not used, t […]
Show full quote

....
To answer your question: 64-bit reads and writes use FPU
registers via x87 (fild/fistp). MMX is intentionally not
used, to maintain compatibility with systems without MMX
support. The finit instruction can introduce significant
overhead on some CPUs, which may explain the 64-bit
anomaly you observed.

To investigate further: use the cursor keys to navigate
to the overhead screens in the benchmark results. These
show the routine overhead in PIT cycles for each access
width and allow a more accurate interpretation of the
raw numbers — particularly for the 64-bit case where
finit overhead may be the dominant factor.

Hi,
Here are the overhead results in PIT cycles. It seems they confirm your theory somewhat:

The attachment FILE0002.PNG is no longer available

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 66 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2026-05-03, 16:28:

Hi,
Here...[CUT]

Thank you for the overhead data.

The FPU overhead values are within normal range — 19-23
PIT cycles for 64-bit and 128-bit reads, 19-20 for
writes. Nothing anomalous here that would explain the
64-bit performance deviation.

The cause remains unclear. With FPU overhead ruled out,
the anomaly may reflect a low-level interaction between
x87 transfer instructions and the memory subsystem on
this specific CPU/chipset combination — something that
does not manifest the same way on other architectures
tested so far. Possibly only an Intel engineer with
access to internal microarchitecture documentation
could give a definitive answer.

This is precisely the kind of unexpected discovery that
makes real hardware testing valuable.

Reply 67 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie

For reference, here is the W_64L routine (64-bit linear
VRAM write via x87 FPU) that produces the anomalous
result on Ivy Bridge. The routine uses fst unrolled 8x
with fninit before the loop. If anyone has insight into
why 64-bit writes behave unexpectedly on this specific
microarchitecture, I would be very interested.

;#############################################################################

W_064L:

;#############################################################################

test ds: byte ptr [status_flag_0],FLAG_FPU_PRESENT
jne enter_fpu_w_64l
stc
ret
enter_fpu_w_64l:
pusha
mov cx,2h
call Init_Align_Data16
mov di,4h
loop_w_064l:
mov esi,ds: dword ptr [current_linear_vram_address]
mov cx,4096d
fninit
fld ds: qword ptr [bx]
xor ax,ax
mov ds,ax
loopw_float_64l:
fst ds: qword ptr [esi]
fst ds: qword ptr [esi + 8h]
fst ds: qword ptr [esi + 10h]
fst ds: qword ptr [esi + 18h]
fst ds: qword ptr [esi + 20h]
fst ds: qword ptr [esi + 28h]
fst ds: qword ptr [esi + 30h]
fst ds: qword ptr [esi + 38h]
add esi,40h
dec cx
jne loopw_float_64l
push gs
pop ds
mov cx,2h
call Inc_Aligned_Data16
dec di
jne loop_w_064l
add ds: dword ptr [kb_proc],1024d
clc
popa
ret

Reply 68 of 79, by Falcosoft

User metadata
Rank l33t
Rank
l33t
Marco Pistella wrote on 2026-05-03, 17:23:
Thank you for the overhead data. […]
Show full quote
Falcosoft wrote on 2026-05-03, 16:28:

Hi,
Here...[CUT]

Thank you for the overhead data.

The FPU overhead values are within normal range — 19-23
PIT cycles for 64-bit and 128-bit reads, 19-20 for
writes. Nothing anomalous here that would explain the
64-bit performance deviation.

The cause remains unclear. With FPU overhead ruled out,
the anomaly may reflect a low-level interaction between
x87 transfer instructions and the memory subsystem on
this specific CPU/chipset combination — something that
does not manifest the same way on other architectures
tested so far. Possibly only an Intel engineer with
access to internal microarchitecture documentation
could give a definitive answer.

This is precisely the kind of unexpected discovery that
makes real hardware testing valuable.

To test if the results are Ivy Bridge specific I enabled the integrated GFX on my Haswell (i7-4770K 3500 MHz), and also enabled write-combining. It seems this anomaly is not Ivy Bridge specific and affects Haswell CPUs even more.
It seems on Haswell even SSE moves are affected somewhat:

The attachment FILE0000.PNG is no longer available
The attachment FILE0001.PNG is no longer available

The interesting thing is that this 64-bit FPU write anomaly only affects the write-combining enabled state. If write-combining is disabled the scaling is the expected one (but of course the absolute performance is much worse):

The attachment FILE0002.PNG is no longer available

@Edit:
I have a theory: Since there are no larger bit depth than 32-bit supported in graphics modes the write-combining logic is more optimized for 8,16 and 32-bit writes. So larger than 32-bit writes somewhat interfere with the write combining logic that results in diminishing return.

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 69 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2026-05-03, 18:38:

I have a theory: Since there are ... [CUT]

That is a reasonable theory. The write-combining logic
in the memory controller is indeed optimized around the
native transaction sizes of the graphics subsystem —
32-bit being the natural boundary for legacy VGA and
VESA framebuffer access. Writes larger than 32-bit via
x87 or SSE may not map cleanly onto the write-combining
buffer granularity, causing partial fills or early
flushes that reduce the effectiveness of coalescing.

It would be consistent with what the data shows:
write-combining provides the expected benefit up to
32-bit, then the return diminishes at 64-bit and
partially recovers at 128-bit and 256-bit where the
transfer size aligns better with cache line boundaries.

To help confirm or rule out this theory, a few things
would be useful: the results page without overhead
subtraction to verify the raw values at these extreme
transfer rates, and multiple runs of the benchmark to
check whether the 64-bit write results are consistent
or show significant variation between runs. If the
values fluctuate widely it would suggest measurement
instability rather than a genuine microarchitecture
effect.

Confirming definitively would require access to Intel
memory controller documentation at a level of detail
that is not publicly available — possibly only an
Intel engineer could give a final answer.

Reply 70 of 79, by Falcosoft

User metadata
Rank l33t
Rank
l33t
Marco Pistella wrote on 2026-05-03, 19:18:
That is a reasonable theory. The write-combining logic in the memory controller is indeed optimized around the native transact […]
Show full quote
Falcosoft wrote on 2026-05-03, 18:38:

I have a theory: Since there are ... [CUT]

That is a reasonable theory. The write-combining logic
in the memory controller is indeed optimized around the
native transaction sizes of the graphics subsystem —
32-bit being the natural boundary for legacy VGA and
VESA framebuffer access. Writes larger than 32-bit via
x87 or SSE may not map cleanly onto the write-combining
buffer granularity, causing partial fills or early
flushes that reduce the effectiveness of coalescing.

It would be consistent with what the data shows:
write-combining provides the expected benefit up to
32-bit, then the return diminishes at 64-bit and
partially recovers at 128-bit and 256-bit where the
transfer size aligns better with cache line boundaries.

To help confirm or rule out this theory, a few things
would be useful: the results page without overhead
subtraction to verify the raw values at these extreme
transfer rates, and multiple runs of the benchmark to
check whether the 64-bit write results are consistent
or show significant variation between runs. If the
values fluctuate widely it would suggest measurement
instability rather than a genuine microarchitecture
effect.

Confirming definitively would require access to Intel
memory controller documentation at a level of detail
that is not publicly available — possibly only an
Intel engineer could give a final answer.

The results are very consistent: from 10 results the difference is about 1%!
The results without overhead are somewhat higher but the proportions are the same:

The attachment FILE0000.PNG is no longer available

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 71 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2026-05-03, 19:27:

The results are very consistent: fr ... [CUT]

Thank you for the consistency data — 1% variation across
10 runs rules out measurement instability as a cause.

The results without overhead show the same proportions,
which also rules out FPU overhead as an explanation.

At this point the most honest conclusion is that this
is a previously undocumented behavior: write-combining
efficiency degrades specifically at 64-bit granularity
on Intel integrated graphics, recovers partially at
128-bit, and reaches its peak at 256-bit. The effect
is reproducible and consistent across Ivy Bridge and
Haswell microarchitectures.

Whether this is a deliberate design choice in the
memory controller, a side effect of the write-combining
buffer granularity, or something else entirely is beyond
what can be determined from these measurements alone.
Further data from other Intel platforms would be welcome,
and if anyone has deeper knowledge of Intel memory
controller internals, I would be very interested to hear
their interpretation.

Reply 72 of 79, by zyzzle

User metadata
Rank Member
Rank
Member

Marco, do you have any idea why I can enable write combining on all Intel integrated graphics from Broadwell below (ie, 1st-5th gen), but my laptop systems always crash on 6th gen and above Intel integrated graphics when attempting to enable writecombining? Is it some kind of BIOS bug and / or castration or limitation on later intel core integrated graphics? I get a hard system freeze when attempting to enable writecombining with either Falcosoft's or RayerR's MTRRLFBE tools or even FastVid from 1996, and always must reboot on Skylake and above.

I'll post some results from i7-5600 Broadwell Intel integrated graphics and i5-8250 Kaby Lake Integrared graphics to show this. I also notice the 64-bit slump and recovery at 128 and max at 256b on Broadwell Intel Integrated graphics using write-combining and DDR3 dual-channel system RAM as reserved graphics memory. On the Kaby Lake system, writecombining wasn't possibe, but I of course got linear progression on uncombined writes but much slower than the writecombined Broadwell system (about 1/40th the speed).

X-VESA is by far the most useful and most comprehensive DOS VESA program I've ever seen. Thanks for releasing it and for all of your feedback.

Reply 73 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
zyzzle wrote on 2026-05-04, 07:36:

Marco, do you have any idea ... [CUT]

Thank you for the kind words.

Regarding the Skylake+ write-combining crash: I suspect
it may be related to a problem I encountered with my
AVX/AVX-512 routines in X-VESA. On one test system the
AVX routines were crashing after transferring between
170 and 280 KB of VRAM — meaning they worked, at least
briefly. The crashes disappeared completely when I
replaced the USB keyboard with a PS/2 one. The discovery
was entirely accidental.

My hypothesis is that USB legacy mode SMI handling
corrupts some CPU state during the critical section —
but this is a hypothesis, not a confirmed diagnosis.
It also seems to depend on the BIOS CSM/UEFI
implementation rather than the chipset itself.

If you can, try connecting a PS/2 keyboard and if
possible disabling USB legacy support in the BIOS
before attempting to enable write-combining. I have
no expectation that it will solve the problem, but
the result — whether it changes anything or not —
would help determine whether the two effects are
related or completely independent.

Reply 74 of 79, by zyzzle

User metadata
Rank Member
Rank
Member
Marco Pistella wrote on 2026-05-04, 08:56:
Regarding the Skylake+ write-combining crash: I suspect it may be related to a problem I encountered with my AVX/AVX-512 routi […]
Show full quote

Regarding the Skylake+ write-combining crash: I suspect
it may be related to a problem I encountered with my
AVX/AVX-512 routines in X-VESA. On one test system the
AVX routines were crashing after transferring between
170 and 280 KB of VRAM — meaning they worked, at least
briefly. The crashes disappeared completely when I
replaced the USB keyboard with a PS/2 one. The discovery
was entirely accidental.

My hypothesis is that USB legacy mode SMI handling
corrupts some CPU state during the critical section —
but this is a hypothesis, not a confirmed diagnosis.
It also seems to depend on the BIOS CSM/UEFI
implementation rather than the chipset itself.

Interesting theory. Your behavior mirrors mine. Sometimes MTRRs seem to be enabled for a very brief time (ie, program prints LFB writecombining is enabled just before complete system freeze. It may be BIOS/CSM specific, but I've tested about 15 different brand laptops with Intel Integrated GPUs and they all fail to enable MTRRs on Skylake and above, but will on Broadwell. As to UEFI implementation, I have tried a mix of UEFI implentations on different brands of laptops for Broadwell and below, and they all work fine. Even the same brand laptop with a Skylake vs a Broadwell chip (and same UEFI) freezes on the 6th gen chip while enables MTRRs on the 5th gen. Only thing changed is the Intel onboard vBIOS.

Unfortunately I can't attach PS/2 keyboard to any of these systems, as they're all laptops, using the builtin keyboards. None has any extant PS/2 ports. I can attach a USB keyboard which will disable the builtin keyboard, however; I'll see if that has any different behavior. I suspect some sort of "legacy mode" implementation bug. Disabling USB Legacy mode if possible will disable my keyboard input, however.

It will help greatly if anyone may test Skylake and above Intel core systems, with Intel integrated onboard graphics in baremetal DOS, and report if writecombining can be successfully enabled. Report your system stats (CPU and exact brand / model of laptop or system used.)

Reply 75 of 79, by Falcosoft

User metadata
Rank l33t
Rank
l33t
zyzzle wrote on 2026-05-04, 22:50:
Interesting theory. Your behavior mirrors mine. Sometimes MTRRs seem to be enabled for a very brief time (ie, program prints LFB […]
Show full quote
Marco Pistella wrote on 2026-05-04, 08:56:
Regarding the Skylake+ write-combining crash: I suspect it may be related to a problem I encountered with my AVX/AVX-512 routi […]
Show full quote

Regarding the Skylake+ write-combining crash: I suspect
it may be related to a problem I encountered with my
AVX/AVX-512 routines in X-VESA. On one test system the
AVX routines were crashing after transferring between
170 and 280 KB of VRAM — meaning they worked, at least
briefly. The crashes disappeared completely when I
replaced the USB keyboard with a PS/2 one. The discovery
was entirely accidental.

My hypothesis is that USB legacy mode SMI handling
corrupts some CPU state during the critical section —
but this is a hypothesis, not a confirmed diagnosis.
It also seems to depend on the BIOS CSM/UEFI
implementation rather than the chipset itself.

Interesting theory. Your behavior mirrors mine. Sometimes MTRRs seem to be enabled for a very brief time (ie, program prints LFB writecombining is enabled just before complete system freeze. It may be BIOS/CSM specific, but I've tested about 15 different brand laptops with Intel Integrated GPUs and they all fail to enable MTRRs on Skylake and above, but will on Broadwell. As to UEFI implementation, I have tried a mix of UEFI implentations on different brands of laptops for Broadwell and below, and they all work fine. Even the same brand laptop with a Skylake vs a Broadwell chip (and same UEFI) freezes on the 6th gen chip while enables MTRRs on the 5th gen. Only thing changed is the Intel onboard vBIOS.

Unfortunately I can't attach PS/2 keyboard to any of these systems, as they're all laptops, using the builtin keyboards. None has any extant PS/2 ports. I can attach a USB keyboard which will disable the builtin keyboard, however; I'll see if that has any different behavior. I suspect some sort of "legacy mode" implementation bug. Disabling USB Legacy mode if possible will disable my keyboard input, however.

It will help greatly if anyone may test Skylake and above Intel core systems, with Intel integrated onboard graphics in baremetal DOS, and report if writecombining can be successfully enabled. Report your system stats (CPU and exact brand / model of laptop or system used.)

Hi,
Just to check whether write-combining mode itself or rather the linear frame buffer is the problem on your systems try this little tool:

The attachment VGAMTRR.zip is no longer available

It only checks and sets the write-combining mode of the real mode VGA color graphics frame buffer (A0000 - AFFFF).
It does this step by step so if it fails we can see at least at what step it fails. If it runs successfully then you can see that the default 'Window - via Int 10h' writes speed up significantly in X-VESA. This means that it can speed up any, even high resolution VESA 1.2 or VESA 2+ modes when linear frame buffer is not used.
Unfortunately I have no Haswell+ system to test this, but here are my Ivy Bridge results:

VGA frame buffer (A0000 - AFFFF) Write combining disabled:

The attachment FILE0000.PNG is no longer available

VGA frame buffer (A0000 - AFFFF) Write combining enabled:

The attachment FILE0001.PNG is no longer available

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 76 of 79, by RayeR

User metadata
Rank Oldbie
Rank
Oldbie

Hi Marco, I missed this interesting thread, just accidentally read about x-vesa in other nvidia vbios bug thread that you well know... 😀
Please upload new versions to the 1st post via edit (as this thread will grow it would be hard to search the latest version through many pages)

Seems to be a cool tool, I have a bunch of various old/newer VGAs to test but much less time...
There existed VESA implementation test tool that was part of Scitech Display Doctor. It could test VESA modes and some features like scrolling... but your tool seems going beyond that.
I hadn't idea why bother about VGA timing jitter, since I have LCD I think it cannot affect the images as LCD monitor buffers whole (or even multiple) frame. But it's interesting how it's varying.
Maybe it could be seen on a CRT with some '90 intro effect where a gradient bar moves up/down over the screen - there I could expect a modern VGAs may show some deviation/flickering instead smooth bar movement. Can someone confirm? My CRT is somewhere behind a ton of old HW thrash so not easy to dig it out 😀

And how do you measure the timing? Remember that classic timer has also some non-zero granularity and XO stability/jitter that may differ on various systems. Since Pentium you can use TSC (Time Stamp Counter) with much better granularity at single CPU tick...

I also appreciate no-bloatware/AI crapcode generation programming attitude 😀 and wonder about AVX usage - it seem's this is the 1st time I see AVX used under DOS ever 😀 It would be much helpful e.g. for playing videos - do you know DUGLplayer based on FFMPEG 5.x for DOS? FFMPEG/x264/x265 seems already contains some optimized MMX/SSE/AVX code but it's disabled under DOS builds because it (may) cause some troubles... Did you find a way to sanitize your AVX code when run with USB legacy emu? Probably it would need the fix on BIOS (SMI code) side. I always tell that such USB KB emulation is a crapware in many cases, just good to launch an OS installer but not good for more intense DOS usage. If you experience unexpected crashes/hangs under DOS just try disable USB legacy-many times it proved to be a cause. Similary AHCI BIOS also cause various unexpected issues...
This was a reason why I did a HW-mod of my MBs to wire out 2nd PS/2 port for mouse (most superIO chips has 2 PS/2 ports but MB manuf. are lazy to wire them out outside a chip package) to have native KB and mouse support for DOS and other legacy OSes...
Foz Zyzzle - you may try to run MTRRLFBE from autoexec.bat with USB legacy disabled (without need of kbd interaction) if it by a luck changes something...

For 64b BW measurement: I would use MMX on CPUs that support it and FPU on those don't or include both approach to keep comparability of old and new systems.

For video memory testing I use VMT (Video Memory Test) tool for DOS that can test VRAM beyond the amount reported by VBE but it relies on LFB so can't use on old VGAs with VBE 1.x. Do you testing full VRAM beyond that used in banked mode?

Gigabyte GA-P67-DS3-B3, Core i7-2600K @4,5GHz, 8GB DDR3, 128GB SSD, GTX970(GF7900GT), SB Audigy + YMF724F + DreamBlaster combo + LPC2ISA

Reply 77 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
RayeR wrote on Yesterday, 22:54:

Hi Marco, I missed this intere ... [CUT]

Welcome — glad you found the thread!

Regarding uploading new versions to the first post: unfortunately Vogons does not allow editing the first post after a certain time has passed. New versions are posted as replies with a clear changelog. It is not ideal but it is the only option available.

On jitter and LCD: you are correct that an LCD buffers the frame and the jitter will not affect the displayed image. The measurement reflects the stability of the signal generated by the card, not the monitor response. On a CRT with raster effects — a gradient bar moving vertically, as you suggest — the deviation would likely be visible on modern GPUs with high σ/μ values. The GT740 trimodal distribution in articular would be a good candidate to observe.

On timing measurement — PIT vs TSC: X-VESA uses the PIT deliberately. The goal is compatibility from 386 onwards, and the TSC is only available from Pentium. The PIT granularity is sufficient for the measurements X-VESA performs, and its stability across systems is well understood. Your point is noted for future consideration on systems where TSC is available.

On AVX under DOS: yes, to my knowledge this is the first DOS tool to use AVX and AVX-512F. The implementation required discovering that VEX-encoded instructions (opcode prefix C5h) are decoded correctly in 16-bit Protected Mode — this is undocumented behavior. XCR0 is configured via xsetbv only once during initialization ; after that, AVX instructions execute directly in PM/16 without requiring a
transition to PM/32. This is what made it possible to use AVX inside a .COM with no external libraries. The entire sequence is hand-written in assembly.

On USB Legacy and SMI: your experience confirms what I found empirically — on one test system the AVX routines crashed consistently until I replaced the USB keyboard with PS/2. The SMI handler polling USB every 1-8ms corrupts the YMM register state during the AVX activation sequence. Disabling USB Legacy in the BIOS eliminates the problem. The long-term fix would be disabling xHCI LEGSUP via PCI config space before AVX execution — this is on the roadmap for a future version.

On 64-bit measurement — MMX vs FPU: X-VESA uses x87 FPU for 64-bit access to maintain compatibility with systems without MMX. Your suggestion to include an MMX path as an alternative is reasonable and noted for future development.

On VRAM measurement: X-VESA measures the physically accessible VRAM via direct hardware probe — independently of what the VESA interface declares in the TotalMemory field. The measurement works in both banked and linear mode, for every video mode separately, and goes beyond the declared limit on VESA 3.0+ cards if requested (with an explicit warning). The reliability test then operates on the measured VRAM with three distinct patterns, configurable start offset, size and number of passes, with a detailed error report including PIT timestamp, absolute offset, written value, read value, pass
number and test type. Full details of both features are in X-VESA.TXT included in the archive.

Looking forward to your results when time allows.

Reply 78 of 79, by Falcosoft

User metadata
Rank l33t
Rank
l33t
Marco Pistella wrote on Today, 03:54:

...On AVX under DOS: yes, to my knowledge this is the first DOS tool to use AVX and AVX-512F.

Hi,
I do not want to be a party killer but my DOS based Mandelbrot benchmark also used AVX (but not AVX 512) 😀
https://falcosoft.hu/manbench.zip

The results page (together with Windows results)
https://falcosoft.hu/mandelx_benchmark_results.php

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 79 of 79, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on Today, 04:38:

Hi,
I do not want to be a party killer but my DO ...[CUT]

Thanks for the link — Manbench is impressive work and the results page is a great reference.

Having looked more carefully, the approaches are quite different. Manbench activates AVX via simd95 and runs under PMODE/W, which is a PM/32 DOS extender — a well-established path where AVX support is expected once XCR0 is configured.

X-VESA takes a different route: it runs entirely in PM/16 with no DOS extender, and the AVX instructions execute directly there. This relies on the fact that VEX-encoded instructions (C5h prefix) are decoded correctly in 16-bit Protected Mode — which as far as I can determine is undocumented behavior. XCR0 is set once via xsetbv during initialization; after that, AVX and AVX-512F execute in PM/16 without any mode switch to PM/32.

So the claim stands with more precision: X-VESA is likely the first DOS tool to use AVX and AVX-512F in PM/16 without a DOS extender, from a .COM file, hand-written in assembly.