VOGONS


Reply 60 of 73, by vutt

User metadata
Rank Member
Rank
Member
Marco Pistella wrote on 2026-05-02, 14:04:
For comparison purposes it would be very useful to repeat Command 4 on 800×600 and 1280×1024 with the original unmodified BIOS […]
Show full quote

For comparison purposes it would be very useful to repeat
Command 4 on 800×600 and 1280×1024 with the original
unmodified BIOS, if you still have access to it. That would confirm whether the jitter is BIOS-induced or
hardware-specific.

I have lost original bios. Not really looking forward to experiment with downloaded ones.
However I have similar PowerColor Radeon 9600 Pro (RV350) with orig BIOS. So I put it into same PC. Please note that I have CRT monitor - Samsung SyncMaster 765mb in case it plays any role.

Reply 61 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on Yesterday, 08:01:

@Marco Pistella: Do 64-bit reads/writes use FPU or MMX registers?
BTW, I very much like your test program 😀 I think this is the only available DOS tool that can measure such high bandwidths reliably.

Thank you — and glad you find it useful!

To answer your question: 64-bit reads and writes use FPU
registers via x87 (fild/fistp). MMX is intentionally not
used, to maintain compatibility with systems without MMX
support. The finit instruction can introduce significant
overhead on some CPUs, which may explain the 64-bit
anomaly you observed.

To investigate further: use the cursor keys to navigate
to the overhead screens in the benchmark results. These
show the routine overhead in PIT cycles for each access
width and allow a more accurate interpretation of the
raw numbers — particularly for the 64-bit case where
finit overhead may be the dominant factor.

Reply 62 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
vutt wrote on Yesterday, 09:13:

I have lost original bios. Not ... [CUT]

Thank you for the comparison with the original BIOS —
this is exactly the kind of controlled test that helps
isolate the cause of the anomalies seen earlier.

The pattern is irregular and the values are still high for a card
of this generation — which suggests the issue may
not be entirely BIOS-related.

To investigate further, could you press D on the
timing screen for each mode to display the horizontal
frequency sample distribution, and post the screenshots?
The shape of the distribution will help understand
whether the jitter has a specific pattern or is random
noise.

Reply 63 of 73, by vutt

User metadata
Rank Member
Rank
Member
Marco Pistella wrote on Yesterday, 10:16:

To investigate further, could you press D on the timing screen for each mode to display the horizontal
frequency sample distribution, and post the screenshots?

As you can see it's different every time.

Reply 64 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
vutt wrote on Yesterday, 12:52:

... [CUT]

Thank you for the distribution screenshots.

The pattern is interesting: almost all samples cluster
at the expected PIT value, but a few outliers appear
significantly above — some exceeding 9,000 and one
above 12,000 PIT cycles. This is not random noise —
it is a discrete, sporadic event.

Could you press + to zoom in on the distribution chart
around the outlier area, and post a screenshot?
Specifically I would like to see whether the outliers
cluster around a specific value (for example a fixed
multiple of the main peak) or are scattered randomly.
This will help distinguish between a spurious bit 3
event on port 3DAh — where the measurement accidentally
captures two retrace intervals instead of one — and a
genuine timing instability in the video signal itself.

Reply 65 of 73, by Falcosoft

User metadata
Rank l33t
Rank
l33t
Marco Pistella wrote on Yesterday, 10:07:
.... To answer your question: 64-bit reads and writes use FPU registers via x87 (fild/fistp). MMX is intentionally not used, t […]
Show full quote

....
To answer your question: 64-bit reads and writes use FPU
registers via x87 (fild/fistp). MMX is intentionally not
used, to maintain compatibility with systems without MMX
support. The finit instruction can introduce significant
overhead on some CPUs, which may explain the 64-bit
anomaly you observed.

To investigate further: use the cursor keys to navigate
to the overhead screens in the benchmark results. These
show the routine overhead in PIT cycles for each access
width and allow a more accurate interpretation of the
raw numbers — particularly for the 64-bit case where
finit overhead may be the dominant factor.

Hi,
Here are the overhead results in PIT cycles. It seems they confirm your theory somewhat:

The attachment FILE0002.PNG is no longer available

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 66 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on Yesterday, 16:28:

Hi,
Here...[CUT]

Thank you for the overhead data.

The FPU overhead values are within normal range — 19-23
PIT cycles for 64-bit and 128-bit reads, 19-20 for
writes. Nothing anomalous here that would explain the
64-bit performance deviation.

The cause remains unclear. With FPU overhead ruled out,
the anomaly may reflect a low-level interaction between
x87 transfer instructions and the memory subsystem on
this specific CPU/chipset combination — something that
does not manifest the same way on other architectures
tested so far. Possibly only an Intel engineer with
access to internal microarchitecture documentation
could give a definitive answer.

This is precisely the kind of unexpected discovery that
makes real hardware testing valuable.

Reply 67 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie

For reference, here is the W_64L routine (64-bit linear
VRAM write via x87 FPU) that produces the anomalous
result on Ivy Bridge. The routine uses fst unrolled 8x
with fninit before the loop. If anyone has insight into
why 64-bit writes behave unexpectedly on this specific
microarchitecture, I would be very interested.

;#############################################################################

W_064L:

;#############################################################################

test ds: byte ptr [status_flag_0],FLAG_FPU_PRESENT
jne enter_fpu_w_64l
stc
ret
enter_fpu_w_64l:
pusha
mov cx,2h
call Init_Align_Data16
mov di,4h
loop_w_064l:
mov esi,ds: dword ptr [current_linear_vram_address]
mov cx,4096d
fninit
fld ds: qword ptr [bx]
xor ax,ax
mov ds,ax
loopw_float_64l:
fst ds: qword ptr [esi]
fst ds: qword ptr [esi + 8h]
fst ds: qword ptr [esi + 10h]
fst ds: qword ptr [esi + 18h]
fst ds: qword ptr [esi + 20h]
fst ds: qword ptr [esi + 28h]
fst ds: qword ptr [esi + 30h]
fst ds: qword ptr [esi + 38h]
add esi,40h
dec cx
jne loopw_float_64l
push gs
pop ds
mov cx,2h
call Inc_Aligned_Data16
dec di
jne loop_w_064l
add ds: dword ptr [kb_proc],1024d
clc
popa
ret

Reply 68 of 73, by Falcosoft

User metadata
Rank l33t
Rank
l33t
Marco Pistella wrote on Yesterday, 17:23:
Thank you for the overhead data. […]
Show full quote
Falcosoft wrote on Yesterday, 16:28:

Hi,
Here...[CUT]

Thank you for the overhead data.

The FPU overhead values are within normal range — 19-23
PIT cycles for 64-bit and 128-bit reads, 19-20 for
writes. Nothing anomalous here that would explain the
64-bit performance deviation.

The cause remains unclear. With FPU overhead ruled out,
the anomaly may reflect a low-level interaction between
x87 transfer instructions and the memory subsystem on
this specific CPU/chipset combination — something that
does not manifest the same way on other architectures
tested so far. Possibly only an Intel engineer with
access to internal microarchitecture documentation
could give a definitive answer.

This is precisely the kind of unexpected discovery that
makes real hardware testing valuable.

To test if the results are Ivy Bridge specific I enabled the integrated GFX on my Haswell (i7-4770K 3500 MHz), and also enabled write-combining. It seems this anomaly is not Ivy Bridge specific and affects Haswell CPUs even more.
It seems on Haswell even SSE moves are affected somewhat:

The attachment FILE0000.PNG is no longer available
The attachment FILE0001.PNG is no longer available

The interesting thing is that this 64-bit FPU write anomaly only affects the write-combining enabled state. If write-combining is disabled the scaling is the expected one (but of course the absolute performance is much worse):

The attachment FILE0002.PNG is no longer available

@Edit:
I have a theory: Since there are no larger bit depth than 32-bit supported in graphics modes the write-combining logic is more optimized for 8,16 and 32-bit writes. So larger than 32-bit writes somewhat interfere with the write combining logic that results in diminishing return.

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 69 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on Yesterday, 18:38:

I have a theory: Since there are ... [CUT]

That is a reasonable theory. The write-combining logic
in the memory controller is indeed optimized around the
native transaction sizes of the graphics subsystem —
32-bit being the natural boundary for legacy VGA and
VESA framebuffer access. Writes larger than 32-bit via
x87 or SSE may not map cleanly onto the write-combining
buffer granularity, causing partial fills or early
flushes that reduce the effectiveness of coalescing.

It would be consistent with what the data shows:
write-combining provides the expected benefit up to
32-bit, then the return diminishes at 64-bit and
partially recovers at 128-bit and 256-bit where the
transfer size aligns better with cache line boundaries.

To help confirm or rule out this theory, a few things
would be useful: the results page without overhead
subtraction to verify the raw values at these extreme
transfer rates, and multiple runs of the benchmark to
check whether the 64-bit write results are consistent
or show significant variation between runs. If the
values fluctuate widely it would suggest measurement
instability rather than a genuine microarchitecture
effect.

Confirming definitively would require access to Intel
memory controller documentation at a level of detail
that is not publicly available — possibly only an
Intel engineer could give a final answer.

Reply 70 of 73, by Falcosoft

User metadata
Rank l33t
Rank
l33t
Marco Pistella wrote on Yesterday, 19:18:
That is a reasonable theory. The write-combining logic in the memory controller is indeed optimized around the native transact […]
Show full quote
Falcosoft wrote on Yesterday, 18:38:

I have a theory: Since there are ... [CUT]

That is a reasonable theory. The write-combining logic
in the memory controller is indeed optimized around the
native transaction sizes of the graphics subsystem —
32-bit being the natural boundary for legacy VGA and
VESA framebuffer access. Writes larger than 32-bit via
x87 or SSE may not map cleanly onto the write-combining
buffer granularity, causing partial fills or early
flushes that reduce the effectiveness of coalescing.

It would be consistent with what the data shows:
write-combining provides the expected benefit up to
32-bit, then the return diminishes at 64-bit and
partially recovers at 128-bit and 256-bit where the
transfer size aligns better with cache line boundaries.

To help confirm or rule out this theory, a few things
would be useful: the results page without overhead
subtraction to verify the raw values at these extreme
transfer rates, and multiple runs of the benchmark to
check whether the 64-bit write results are consistent
or show significant variation between runs. If the
values fluctuate widely it would suggest measurement
instability rather than a genuine microarchitecture
effect.

Confirming definitively would require access to Intel
memory controller documentation at a level of detail
that is not publicly available — possibly only an
Intel engineer could give a final answer.

The results are very consistent: from 10 results the difference is about 1%!
The results without overhead are somewhat higher but the proportions are the same:

The attachment FILE0000.PNG is no longer available

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 71 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on Yesterday, 19:27:

The results are very consistent: fr ... [CUT]

Thank you for the consistency data — 1% variation across
10 runs rules out measurement instability as a cause.

The results without overhead show the same proportions,
which also rules out FPU overhead as an explanation.

At this point the most honest conclusion is that this
is a previously undocumented behavior: write-combining
efficiency degrades specifically at 64-bit granularity
on Intel integrated graphics, recovers partially at
128-bit, and reaches its peak at 256-bit. The effect
is reproducible and consistent across Ivy Bridge and
Haswell microarchitectures.

Whether this is a deliberate design choice in the
memory controller, a side effect of the write-combining
buffer granularity, or something else entirely is beyond
what can be determined from these measurements alone.
Further data from other Intel platforms would be welcome,
and if anyone has deeper knowledge of Intel memory
controller internals, I would be very interested to hear
their interpretation.

Reply 72 of 73, by zyzzle

User metadata
Rank Member
Rank
Member

Marco, do you have any idea why I can enable write combining on all Intel integrated graphics from Broadwell below (ie, 1st-5th gen), but my laptop systems always crash on 6th gen and above Intel integrated graphics when attempting to enable writecombining? Is it some kind of BIOS bug and / or castration or limitation on later intel core integrated graphics? I get a hard system freeze when attempting to enable writecombining with either Falcosoft's or RayerR's MTRRLFBE tools or even FastVid from 1996, and always must reboot on Skylake and above.

I'll post some results from i7-5600 Broadwell Intel integrated graphics and i5-8250 Kaby Lake Integrared graphics to show this. I also notice the 64-bit slump and recovery at 128 and max at 256b on Broadwell Intel Integrated graphics using write-combining and DDR3 dual-channel system RAM as reserved graphics memory. On the Kaby Lake system, writecombining wasn't possibe, but I of course got linear progression on uncombined writes but much slower than the writecombined Broadwell system (about 1/40th the speed).

X-VESA is by far the most useful and most comprehensive DOS VESA program I've ever seen. Thanks for releasing it and for all of your feedback.

Reply 73 of 73, by Marco Pistella

User metadata
Rank Newbie
Rank
Newbie
zyzzle wrote on Today, 07:36:

Marco, do you have any idea ... [CUT]

Thank you for the kind words.

Regarding the Skylake+ write-combining crash: I suspect
it may be related to a problem I encountered with my
AVX/AVX-512 routines in X-VESA. On one test system the
AVX routines were crashing after transferring between
170 and 280 KB of VRAM — meaning they worked, at least
briefly. The crashes disappeared completely when I
replaced the USB keyboard with a PS/2 one. The discovery
was entirely accidental.

My hypothesis is that USB legacy mode SMI handling
corrupts some CPU state during the critical section —
but this is a hypothesis, not a confirmed diagnosis.
It also seems to depend on the BIOS CSM/UEFI
implementation rather than the chipset itself.

If you can, try connecting a PS/2 keyboard and if
possible disabling USB legacy support in the BIOS
before attempting to enable write-combining. I have
no expectation that it will solve the problem, but
the result — whether it changes anything or not —
would help determine whether the two effects are
related or completely independent.