VOGONS


x86 microarchitecture benchmark (MandelX)

Topic actions

Reply 60 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t
jtchip wrote on 2025-06-27, 23:51:
Anyway, interestingly Esther (VIA C7) slightly beats Bonnell in SSE2 on this workload, 16 pixels/ms vs 14, the only "win" it has […]
Show full quote

Anyway, interestingly Esther (VIA C7) slightly beats Bonnell in SSE2 on this workload, 16 pixels/ms vs 14, the only "win" it has.

The rest of my results are from a NSC Geode GX1 300MHz (FPU_FAST enabled, slowest ALU result), Athlon 5350 (Kabini, slowest AVX), and Athlon 64 X2 5000+. The model names from CPUID (perhaps the DOS version should output this too), including the C7-D, are (from /proc/cpuinfo in Linux):

  • VIA Esther processor 1500MHz
  • Geode(TM) Integrated Processor by National Semi
  • AMD Athlon(tm) 5350 APU with Radeon(tm) R3
  • AMD Athlon(tm) 64 X2 Dual Core Processor 5000+

Thanks, I have uploaded your attached 3 result sets.
Maybe I overlook something but I cannot find your mentioned VIA Esther results.

Regarding you Athlon 64 results:
It's interesting that your desktop version of Athlon 64 X2 is ~5% faster in 1GHz normalized ALU/integer calculations (and only in ALU/integer) compared to my Turion 64 X2.
I re-tested the mobile Turion X2 also with the DOS version and the difference is consistent.
Both have the same sized L1+L2 caches so this cannot explain the difference.
Maybe it's because the mobile version shares the memory with the integrated ATI Radeon Xpress 1150. And the ALU/integer code path is the only one where inside the 'hot' inner loop memory is also accessed.
It's because there are fewer freely available registers than with other instruction sets. There are only 6 freely available general purpose integer registers in 32-bit code (namely EAX, EBX, ECX, EDX, EDI, ESI) while in all other code paths you have 8 freely available FPU/MMX/SSE/AVX registers + the general purpose integer ones.

The attachment Turion_a64.png is no longer available

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 61 of 72, by jtchip

User metadata
Rank Member
Rank
Member
Falcosoft wrote on 2025-06-28, 09:49:

Maybe I overlook something but I cannot find your mentioned VIA Esther results.

Esther is the C7-D from the previous set of results as a correction to the name in the table. For some reason VIA use code names in the CPUID model name string.

Falcosoft wrote on 2025-06-28, 09:49:

Maybe it's because the mobile version shares the memory with the integrated ATI Radeon Xpress 1150. And the ALU/integer code path is the only one where inside the 'hot' inner loop memory is also accessed.

Mine also uses integrated graphics, Radeon HD 3200 (RS780). Perhaps memory speed, Wikipedia says the Turion uses DDR2-667 whereas I'm using DDR2-800 (in practice 743 MT/s due to the odd multiplier) but that should only apply if the workload spills out of the caches.

The Kabini (Jaguar microarchitecture) results are somewhat interesting, it seems to do well at this workload. SSE[2] results are on-par with the Phenom, probably as it has 128-bit SIMD, FPU is lower, more in line with the later Zen 1/2, and ALU is a bit slower than K8, maybe because it's only 2-wide being a "small" core. It's performing much better than the Atom Silvermont, which it was competing with at the time.

Reply 62 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t
jtchip wrote on 2025-06-29, 00:06:

...Esther is the C7-D from the previous set of results as a correction to the name in the table. For some reason VIA use code names in the CPUID model name string.

Ahh, OK. I have corrected the name in the database.

jtchip wrote on 2025-06-29, 00:06:

Mine also uses integrated graphics, Radeon HD 3200 (RS780). Perhaps memory speed, Wikipedia says the Turion uses DDR2-667 whereas I'm using DDR2-800 (in practice 743 MT/s due to the odd multiplier) but that should only apply if the workload spills out of the caches.

Hmm. Then I have no other tips but this one:
The only difference can be the manufacturing process. The Turion 64 X2 is a 90nm one. Maybe the Athlon64 X2 5000+ is a 65nm one with different cache characteristics.
BTW, my Turion64 X2 runs at 2100 MHz and the memory clock divider is set to 5 so the effective memory speed is 420 (840 MHz DDR2). I modified it because the integrated Radeon Xpress 1150 graphics performs better this way.

jtchip wrote on 2025-06-29, 00:06:

The Kabini (Jaguar microarchitecture) results are somewhat interesting, it seems to do well at this workload. SSE[2] results are on-par with the Phenom, probably as it has 128-bit SIMD, FPU is lower, more in line with the later Zen 1/2, and ALU is a bit slower than K8, maybe because it's only 2-wide being a "small" core. It's performing much better than the Atom Silvermont, which it was competing with at the time.

Yep, the Jaguar has real 128-bit SIMD registers while the Bobcat has only 64-bit 'physical' registers and uses two such registers to handle 128-bit vectors. This explains why the Jaguar shines in case of SSE/SS2 workloads but not so much in case of AVX (where it uses two 128-bit real registers to handle 256-bit vectors).
It would be interesting to see how a Bobcat performs using SSE/SSE2 (most likely the performance is similar to K8/Athlon 64) .

PS:
An interesting remark from Agner Fog regarding the Atoms poor x87 FPU performance:

14.5 X87 floating point instructions Instructions that use the x87-style floating point registers are handled in a very unfortu […]
Show full quote

14.5 X87 floating point instructions
Instructions that use the x87-style floating point registers are handled in a very unfortunate
way by the Atom processor. Whenever there are two consecutive x87 instructions, the two
instructions fail to pair and instead cause an extra delay of one clock cycle due to problems
in the decoders. This gives a throughput of only one instruction every two clock cycles,
while a similar code using XMM registers would have a maximum throughput of two
instructions per clock cycle.

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 63 of 72, by jtchip

User metadata
Rank Member
Rank
Member
Falcosoft wrote on 2025-06-29, 14:21:

The only difference can be the manufacturing process. The Turion 64 X2 is a 90nm one. Maybe the Athlon64 X2 5000+ is a 65nm one with different cache characteristics.
BTW, my Turion64 X2 runs at 2100 MHz and the memory clock divider is set to 5 so the effective memory speed is 420 (840 MHz DDR2). I modified it because the integrated Radeon Xpress 1150 graphics performs better this way.

This is indeed a 65nm Brisbane. Perhaps we'll have to wait for more data points from other K8 results.

Falcosoft wrote on 2025-06-29, 14:21:

PS:
An interesting remark from Agner Fog regarding the Atoms poor x87 FPU performance:

14.5 X87 floating point instructions Instructions that use the x87-style floating point registers are handled in a very unfortu […]
Show full quote

14.5 X87 floating point instructions
Instructions that use the x87-style floating point registers are handled in a very unfortunate
way by the Atom processor. Whenever there are two consecutive x87 instructions, the two
instructions fail to pair and instead cause an extra delay of one clock cycle due to problems
in the decoders. This gives a throughput of only one instruction every two clock cycles,
while a similar code using XMM registers would have a maximum throughput of two
instructions per clock cycle.

OK, looks like this is the 1st-gen Atom i.e Bonnell.

Another result, from an "Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz", which has a turbo of 3.6GHz. Scaling it accordingly does make it a closer match to the only other Haswell result in the table (the i7-4702MQ), at least for FPU, SSE and SSE2.

Reply 64 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t
jtchip wrote on 2025-06-29, 23:28:

...OK, looks like this is the 1st-gen Atom i.e Bonnell.

Not necessarily. The difference between x87 FPU and SSE results proportionally is even bigger in case of 2nd gen Atoms.
1st Gen: SSE is about 4x faster than x87 FPU.
2nd Gen: SSE is about 5x faster than x87 FPU.

jtchip wrote on 2025-06-29, 23:28:

Another result, from an "Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz", which has a turbo of 3.6GHz. Scaling it accordingly does make it a closer match to the only other Haswell result in the table (the i7-4702MQ), at least for FPU, SSE and SSE2.

In DOS there is no turbo boost or any other dynamic frequency adjustments. This needs a processor driver that DOS lacks.
With utilities you can set the multiplier manually even in DOS but without such manual adjustment the CPU runs constantly at nominal speed shown by the TSC value.
At least this is 100% true for Sandy/Ivy Bridge (personal experience) but it seems the situation is the same in case of Haswell, too:
Re: CpuSpd - A Hardware Based CPU Speed Control Utility for DOS/Win9X Retro Gaming

BTW, Even this Xeon CPU of yours runs without any version of Windows?
It would be nice to see how it performs under Windows.

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 65 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t
jtchip wrote on 2025-06-29, 23:28:

...
This is indeed a 65nm Brisbane. Perhaps we'll have to wait for more data points from other K8 results.

And here we have another Brisbane result from argh.
AMD Athlon(tm) Dual Core Processor 4850e:
CPU Speed: 2504
OS: Windows 2000 x86

ALU 1GHz: 37.664
FPU 1GHz: 32.348
SSE 1GHz: 84.908
SSE2 1GHz: 52.181
3DNow! 1GHz: 57.448

These results perfectly reflects the Turion 64 X2 results.
So your DOS Brisbane results are most likely due to the difference between the DOS and Windows versions of the benchmark...

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 66 of 72, by jtchip

User metadata
Rank Member
Rank
Member
Falcosoft wrote on 2025-06-30, 07:18:

Not necessarily. The difference between x87 FPU and SSE results proportionally is even bigger in case of 2nd gen Atoms.
1st Gen: SSE is about 4x faster than x87 FPU.
2nd Gen: SSE is about 5x faster than x87 FPU.

That quote was from the Bonnell section of Agner Fog's microarchitecture document. I looked a bit further and

15.12 Bottlenecks in Silvermont ... Execution ports and execution units ... The disastrous performance of the Atom on legacy x87 […]
Show full quote

15.12 Bottlenecks in Silvermont
...
Execution ports and execution units
...
The disastrous performance of the Atom on legacy x87 code has finally been repaired.

Perhaps it's just that Silvermont improved SSE performance even more.

Falcosoft wrote on 2025-06-30, 07:18:
In DOS there is no turbo boost or any other dynamic frequency adjustments. This needs a processor driver that DOS lacks. With u […]
Show full quote

In DOS there is no turbo boost or any other dynamic frequency adjustments. This needs a processor driver that DOS lacks.
With utilities you can set the multiplier manually even in DOS but without such manual adjustment the CPU runs constantly at nominal speed shown by the TSC value.
At least this is 100% true for Sandy/Ivy Bridge (personal experience) but it seems the situation is the same in case of Haswell, too:
Re: CpuSpd - A Hardware Based CPU Speed Control Utility for DOS/Win9X Retro Gaming

BTW, Even this Xeon CPU of yours runs without any version of Windows?
It would be nice to see how it performs under Windows.

Fair enough, I did wonder initially whether it would need a driver to reach its turbo speeds.

The Xeon system is being offloaded to someone else and has no storage devices, I almost forgot that I still have it. It had been running Ubuntu Linux (only) anyway since about 5 years ago.

Falcosoft wrote on 2025-06-30, 22:50:

So your DOS Brisbane results are most likely due to the difference between the DOS and Windows versions of the benchmark...

That would appear to be the case, I re-ran it under Windows XP and the results are slightly slower all round, with the ALU pretty much matching the other K8 results in Windows. Doesn't explain why your Turion was still the same speed in DOS though.

I thought your Core 2 Duo results looked a bit lonely so here are 2 more, an "Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz" and "Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz". Again the ALU results are 7% higher while the others are closer.

Reply 67 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t
jtchip wrote on 2025-07-01, 00:31:

...
That would appear to be the case, I re-ran it under Windows XP and the results are slightly slower all round, with the ALU pretty much matching the other K8 results in Windows. Doesn't explain why your Turion was still the same speed in DOS though.

Would you check if your DOS version is the same (date: 2025. ‎06. ‎25 ‏‎8:56:22; size: 72 239 bytes) as the one that can be downloaded from here?
https://falcosoft.hu/manbench.zip

First time I uploaded a version that could produce faster ALU results but I corrected it within hours. I saw that it was downloaded only 2 times and I thought you could download the correct one by the time you usually login.
So even the attached forum version should contain the right one now, but maybe some cached version could cause an issue. So just let's make sure.

I think that there can be some version mismatch because under DOS I cannot reproduce that high ALU results as yours with my Core 2 Duo E7500 either. The 1GHz normalized ALU result is always at about ~36/37. The other 1GHz normalized results match perfectly with yours.

CPU Vendor: GenuineIntel
CPU ID: 01067A
CPU speed: 3958 MHz
System: DOS 7.10
Mode: Text

Time(msec) Pixels/msec Pixels/msec(1GHz)
ALU: 5484 143.40 36.23
FPU: 6234 126.15 31.87
3DNow!: N/A N/A N/A
SSE: 1805 435.70 110.08
SSE2: 3286 239.33 60.47
AVX: N/A N/A N/A

BTW, it's obvious that performance does not scale perfectly linearly with increased frequency although MandelX benchmark is closer to linear scaling compared to other benchmarks.
But CPUs with higher frequencies are still somewhat at a disadvantage when the 1GHz normalized values are calculated.

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 68 of 72, by jtchip

User metadata
Rank Member
Rank
Member
Falcosoft wrote on 2025-07-01, 05:06:
Would you check if your DOS version is the same (date: 2025. ‎06. ‎25 ‏‎8:56:22; size: 72 239 bytes) as the one that can be d […]
Show full quote

Would you check if your DOS version is the same (date: 2025. ‎06. ‎25 ‏‎8:56:22; size: 72 239 bytes) as the one that can be downloaded from here?
https://falcosoft.hu/manbench.zip

First time I uploaded a version that could produce faster ALU results but I corrected it within hours. I saw that it was downloaded only 2 times and I thought you could download the correct one by the time you usually login.
So even the attached forum version should contain the right one now, but maybe some cached version could cause an issue. So just let's make sure.

Yep, that explains it, I would have been one of the two people who downloaded the older version (2025-06-25 01:45:22, 72059B, with no text files) as I sometimes browse at different times without logging in. You should have pointed out that the archive was updated.

Anyway, I've re-run it on the systems that are convenient to access and the ALU results are more in line with the Windows version now.

Reply 69 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t
jtchip wrote on 2025-07-01, 23:17:
Falcosoft wrote on 2025-07-01, 05:06:
Would you check if your DOS version is the same (date: 2025. ‎06. ‎25 ‏‎8:56:22; size: 72 239 bytes) as the one that can be d […]
Show full quote

Would you check if your DOS version is the same (date: 2025. ‎06. ‎25 ‏‎8:56:22; size: 72 239 bytes) as the one that can be downloaded from here?
https://falcosoft.hu/manbench.zip

First time I uploaded a version that could produce faster ALU results but I corrected it within hours. I saw that it was downloaded only 2 times and I thought you could download the correct one by the time you usually login.
So even the attached forum version should contain the right one now, but maybe some cached version could cause an issue. So just let's make sure.

Yep, that explains it, I would have been one of the two people who downloaded the older version (2025-06-25 01:45:22, 72059B, with no text files) as I sometimes browse at different times without logging in. You should have pointed out that the archive was updated.

Anyway, I've re-run it on the systems that are convenient to access and the ALU results are more in line with the Windows version now.

Yes, I should have definitely made a remark about this change...
But I was convinced that there was no problem since I saw you logged in and replied on 26 Jun 2025, 01:48 when the attachment had been the right one for hours.

Thanks for the new results anyway, I have uploaded the new ones and corrected the existing ones.
I have also corrected the other 2 records affected by the problematic DOS version (Kabini APU, Geode) by giving the ALU results a 5% penalty.

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 70 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t

It seems the 'standard' DOS version provides reliable and comparable results for all code paths.
I do not think it's a problem that the DOS version is generally faster about a few percent since this kind of performance difference also exists between Windows versions.
There is a clear trend visible that earlier Windows versions are generally faster. So based on the available results the performance order of the OS versions are the following:
DOS > Windows98/Windows 2000/Windows XP 32-bit > Windows 7/8/10/11 64-bit.

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)

Reply 71 of 72, by myne

User metadata
Rank Oldbie
Rank
Oldbie

Not at all surprising.
Dos basically sits in the background and hands full control to the app.
Windows has added a billion background services over the years and they've become increasingly complex.
Compare the memory footprint of NT4's 300kb services to 11s 10mb services - even the exact same services are bigger. Like it somehow needs more code to run Windows file sharing.
Some is probably caching, but I suspect it is mostly a combination of fixing every 1 in 3bn edge case ever encountered, security additions, and pure bloat.

I built:
Convert old ASUS ASC boardviews to KICAD PCB!
Re: A comprehensive guide to install and play MechWarrior 2 on new versions on Windows.
Dos+Windows 3.11+tcp+vbe_svga auto-install iso template
Script to backup Win9x\ME drivers from a working install
Re: The thing no one asked for: KICAD 440bx reference schematic

Reply 72 of 72, by Falcosoft

User metadata
Rank l33t
Rank
l33t

If someone is interested I have added a result comparison option to the results web page:

https://falcosoft.hu/mandelx_benchmark_results.php

The attachment web_compare.png is no longer available

Website, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper
x86 microarchitecture benchmark (MandelX)