@RayeR
Re: AVX with DJGPP, alignment, V86, SMI
DJGPP + AVX
CWSDPR0 is the right call. The fundamental issue is XSETBV requiring ring 0 — under standard CWSDPMI (ring 3 DPMI) you can't execute it directly. CWSDPR0 solves this cleanly. GCC supports AVX intrinsics with -mavx / -mavx2 / -mavx512f flags on any reasonably recent version, so the compiler side is straightforward once XCR0 is properly enabled.
Alignment
Correct on the requirements: 16B for SSE, 32B for AVX, 64B for AVX-512 zmm registers. However the distinction between aligned and unaligned variants matters: vmovaps/vmovapd will fault on misaligned data while vmovups/vmovupd handle any alignment. On Haswell and later the penalty for unaligned vmovups on actually-aligned data is zero at runtime — the hardware detects alignment. On older AVX (Sandy Bridge / Ivy Bridge) the penalty is real even with vmovups, so explicit 32B alignment in the data structure is worth it for portable code.
V86 / JEMM386
The core problem is that XSETBV in V86 generates #GP, which the VMM (JEMM386) would need to intercept, validate, and re-execute in ring 0. Technically feasible — Japheth would know the internal hooks required — but non-trivial. Worth asking, the JEMM386 codebase is well-maintained.
CLI/STI and SMI — you are right, I was imprecise
CLI does not block SMI. SMI has higher priority than NMI and is invisible to the processor's interrupt flag entirely — it triggers an immediate switch to SMM regardless of CPL or EFLAGS.IF. What CLI prevents in the YMM corruption scenario is a regular IRQ firing between the XCR0 write and the first YMM instruction, but the actual culprit on i5-7500/Asrock 200-series is the USB Legacy SMI from the xHCI controller, which CLI cannot stop.
The correct fix is exactly what you described: disable xHCI LEGSUP via PCI config space. The sequence is: locate the xHCI controller (class 0C/03/30), walk the extended capabilities to find USB Legacy upport (cap ID 1), set HC_OS_OWNED and clear the SMI enables in USBLEGCTLSTS. This is chipset-dependent in detail but the xHCI spec defines the capability structure so it is portable across compliant implementations. That work is planned for X-PCI, the PCI config space utility currently in the X-VESA roadmap.
@jmarsh
Both good points worth expanding.
On the Intel/AMD AVX-512 situation
The irony is complete: Intel introduced AVX-512 with Skylake-X in 2017, then dropped it from all consumer parts starting with Alder Lake (gen 12) due to the hybrid P+E core architecture — E cores don't support it, so rather than handle the asymmetry Intel disabled it globally on consumer SKUs. Raptor Lake, Meteor Lake, and Arrow Lake all follow the same pattern. Intel Xeon server parts still have it.
AMD went the opposite direction: Zen 4 (Ryzen 7000, 2022) added AVX-512 and it has been present on every AMD desktop part since. The AMD implementation covers the most practically useful subsets (F, BW, CD, DQ, VL, VNNI, VBMI, VBMI2 among others) — not the full Intel server feature set but more than enough for compute workloads.
So the current landscape is: if you want AVX-512 on a consumer CPU today, buy AMD.
On VEX encoding and the three-operand form
Exactly right. The non-destructive destination is underappreciated as an optimization opportunity. Legacy SSE: paddq xmm0,xmm1 destroys xmm0. VEX: vpaddq ymm0,ymm1,ymm2 leaves ymm1 and ymm2 intact. In tight loops this eliminates several register-to-register moves that exist purely to preserve values, reducing both code size and execution pressure on the move units.
There is a related benefit: VEX-encoded 128-bit instructions zero the upper 128 bits of the destination YMM register, which avoids the transition penalty between legacy SSE and VEX code paths that AVX-capable CPUs impose to manage dirty upper state. Mixing legacy SSE and VEX instructions in the same code path without being aware of this is a common source of unexpected performance degradation.
@zyzzle
Thanks for the detailed report. Addressing each point:
1366x768 — by design, not a bug
The horizontal resolution must produce a scanline length in bytes divisible by 4 — this is a hard requirement for all of X-VESA's internal graphic routines. 1366x768 fails this (1366 × bytes-per-pixel
is never divisible by 4 for any standard depth).
The obvious workaround would be to use 4F06h to request a logical scanline width of 1368, which IS divisible by 4. However, testing showed that on several controllers this produces an undefined state — the graphics controller accepts the call but enters an inconsistent configuration. Since a clear error message is preferable to a random freeze, all resolutions that cannot produce a valid scanline length are
explicitly rejected. This is unlikely to change without a substantial rewrite of the rendering routines.
Ivy Bridge VGA timings — confirmed
Ivy Bridge is the last Intel iGPU generation with complete legacy VGA timing register support. From Haswell onwards those code paths are progressively disabled or emulated. This is hardware behavior, not an X-VESA limitation.
Kaby Lake LFB freeze — confirmed
Known behavior on modern Intel iGPUs in CSM mode. Banked mode is the correct workaround for bandwidth testing on that system.
Regarding the Kaby Lake Write 64b anomaly (182,000 MiB/s)
The transfer rate measurement itself is reliable. What fails at these speeds is exclusively the overhead calculation.
X-VESA uses the PIT for all timing, including overhead measurement, in order to maintain compatibility with CPUs that predate RDTSC — At transfer rates in the Gb/s range the overhead calculation ecomes unreliable for a specific reason: the statistical sample used (32 iterations) produces a value that is of the same order of magnitude as the PIT measurement error itself. X-VESA includes a ompensation algorithm for this — effectively measuring the overhead of the overhead measurement — but at these speeds even that cannot fully compensate for the fundamental granularity limit of the PIT (~838ns per tick).
Increasing the number of samples for the overhead calculation would reduce the problem but not eliminate it, and would introduce an asymmetry in the measurement methodology between overhead and
transfer rate that would complicate interpretation of results.
The correct reading of the Kaby Lake data is therefore: the raw transfer rate values are valid, the overhead-subtracted values at very high bandwidths are not meaningful and should be disregarded.
This is a known limitation of PIT-based measurement at Gb/s speeds and not a defect in the transfer rate benchmark itself.
Renaming X-VESA.COM
The file is not designed to be renamed. Please use it as distributed.
SuperDoubleTiny and UPX
The STD-STUB V1.0.0 you found is the SuperDoubleTiny bootstrap, a custom loader developed specifically for X-VESA. The reason SDT exists is architectural: DOS COM files are inherently limited to a single 64KB segment shared by code, data, and stack. X-VESA requires a full 64KB code segment AND a separate full 64KB data segment simultaneously — a memory model that is impossible for a standard COM file.
Here is how SDT works. Three 64KB segments are involved:
SEG1 — starting segment of X-VESA.COM (code segment)
SEG2 — used by APACK as decompression workspace
SEG3 — used by the stub and as stack
a) STUB.COM starts and copies itself into SEG3
b) STUB.COM executes a RETF from SEG3
c) STUB.COM copies compressed DATA.COM from SEG1 into SEG3
behind itself
d) STUB.COM reallocates compressed CODE.COM in SEG1 at ORG 0100h
e) STUB.COM executes RETF handing control to CODE.COM in SEG1:0100h
f) Compressed CODE.COM copies itself into SEG2
g) APACK stub executes FAR JMP into SEG2 and decompresses CODE.COM
back into SEG1
h) FAR JMP from SEG2 to SEG1:0100h — decompressed CODE.COM runs
i) CODE.COM detects stack != 0FFFEh — STUB.COM return value is on
stack in SEG3
j) CODE.COM swaps STUB.COM RETF address in SEG3 with its own
reentry point in SEG1
k) CODE.COM executes RETF jumping to STUB.COM in SEG3
l) STUB.COM copies compressed DATA.COM into SEG1 immediately after
the end of CODE.COM
m) STUB.COM executes RETF returning to CODE.COM in SEG1
(reentry point set in step j)
n) CODE.COM pushes its own reentry address onto the stack then
executes RETF jumping to compressed DATA.COM
o) DATA.COM APACK stub copies itself into the next 64KB and executes
FAR JMP which decompresses DATA.COM in place
p) Decompressed DATA.COM contains a single RETF as its first
instruction, returning control to CODE.COM in SEG1
q) X-VESA is now fully operational with 64KB code and 64KB data
in separate segments
UPX is incompatible with this model for three concrete reasons. First, during decompression UPX writes into memory areas beyond its expected output range, corrupting the stub residing in SEG3. Second,
even with --ultra-brute UPX produces files slightly larger than the APACK result — there is no compression gain to justify the effort. Third, when applied to the SDT model specifically, UPX-compressed
binaries fail to execute on an IBM 5150 — this is not a general UPX limitation on 8088 hardware, but a failure specific to the interaction between UPX's decompression behaviour and the SDT memory layout.
X-VESA requires 302240 bytes of conventional memory and is designed to run on any hardware from an IBM 5150 to a 2026 system with a legacy CSM BIOS. On an IBM 5150 it starts correctly and reports
"80386 or above required" — a clean, graceful exit. If conventional memory is insufficient it reports the shortage and exits without crashing. This compatibility range is non-negotiable and any change
to the bootstrap that breaks it, as UPX does, is not acceptable regardless of other considerations.