VOGONS


First post, by OttoPS

User metadata
Rank Newbie
Rank
Newbie

Hi everyone,

I want to share p4tool, a DOS utility for Intel Pentium 4 / NetBurst systems focused on performance control beyond traditional slowdown methods.

Project page:
GitHub repository

Most slowdown tools on these systems rely on one of these approaches:

ODCM duty-cycle throttling

  • Reduces CPU performance
  • Also reduces external BUS throughput proportionally
  • Affects overall system responsiveness (video, memory, I/O)

In practice, this behaves as a global throttle affecting the entire platform.

Cache disabling via CR0

  • Produces inconsistent or limited effects on NetBurst
  • Does not fully disable all cache structures (trace/uop cache remains active)
  • Not suitable for fine-grained control
  • Can be overridden by software (e.g. Ultima VII re-enabling cache at runtime)

This makes CR0-based approaches unreliable.

What p4tool does differently

p4tool avoids CR0-based mechanisms and instead relies on:

  • MSR-based control
  • MTRR memory policy manipulation
  • Debug Store (DS) / Branch Trace Store (BTS) effects

These allow independent control over:

  • CPU execution behavior
  • Memory access characteristics
  • Overall system responsiveness

Techniques implemented

  • ODCM throttling (baseline reference)
  • MSR-based full cache disable (true global uncached mode)
  • Debug Store (DS) slowdown
  • Debug Store + BTS slowdown
  • MTRR manipulation (main RAM, base 0)
  • IA32_MTRRdefType override (global memory type)

Key observations

These techniques behave very differently:

  • ODCM -> global slowdown (CPU + memory + bus + video all degrade together)
  • MSR full uncached mode -> strong, consistent system-wide slowdown
  • DS / BTS -> CPU execution degradation without bus impact
  • MTRR (range-based) -> memory behavior changes without affecting instruction flow

Some methods can significantly reduce CPU performance while keeping video throughput relatively stable, unlike ODCM.

NetBurst-specific notes

  • CR0 does not fully disable cache effects
  • Range-based MTRRs only affect the data path
  • Trace cache (uop cache) remains active unless global policies are used

Because of this:

p4tool does not rely on CR0

Also, some DOS software (like Ultima VII) modifies CR0 at runtime, which can break traditional slowdown tools.

Using IA32_MTRRdefType instead provides a stable and consistent global uncached mode.

Planned comparisons (SpeedSys)

ODCM-only slowdown

  • CPU performance reduced
  • Memory throughput reduced
  • Video bandwidth significantly degraded
  • Fully proportional slowdown across the platform
The attachment Baseline+o1.jpeg is no longer available

Full cache disable (MSR-based)

  • Strong CPU performance reduction
  • Memory access significantly slower
  • Consistent and predictable behavior
  • Affects both data cache and trace cache

This produces a true global uncached state on NetBurst.

The attachment Baseline+cd.jpeg is no longer available

Debug Store / BTS slowdown

  • CPU performance reduced
  • Memory behavior affected differently
  • Does not behave like a cache-level slowdown
  • Video throughput remains comparatively stable
The attachment Baseline+ds.jpeg is no longer available
The attachment Baseline+dsbts.jpeg is no longer available

MTRR (main RAM base 0)

  • Strong impact on memory throughput
  • CPU affected indirectly
  • Different profile compared to full uncached mode
The attachment Baseline+mtrr0uc.jpeg is no longer available

Combined techniques

  • Fine-grained performance tuning
  • Intermediate performance levels
  • Better balance between CPU / memory / video

Goal

The goal is not just to slow down a Pentium 4 system, but to make it usable across performance ranges that are normally:

  • Too fast with standard throttling
  • Or far too slow with cache-based approaches

If there’s interest, I can also share more technical details about:

  • MSR-based cache/memory control
  • Debug Store / BTS behavior on NetBurst
  • Practical differences between slowdown techniques

Reply 1 of 13, by Shponglefan

User metadata
Rank l33t
Rank
l33t

Thank you for this! I've got a Pentium 4 (Cedar Mill) retro system that I will try this tool out with.

I have been using cache disabling and ODCM throttling in the past. I did notice with Ultima VII that ODCM throttling does cause video delays for things like the earthquake effect at the beginning of the game. I'll be curious to see how your throttling tool works in comparison.

Pentium 4 Multi-OS Build
486 DX4-100 with 6 sound cards
486 DX-33 with 5 sound cards

Reply 2 of 13, by OttoPS

User metadata
Rank Newbie
Rank
Newbie

Hi! I hope this works for you!

Ultima VII is a special case because it's speed-sensitive and also has the infamous 5-step stuttering.

In my case, on a 2.8GHz Northwood CPU, the command line I use is "p4tool cd dsbts".
This completely disables the CPU cache, so Ultima VII doesn't re-enable it, and it also causes a general performance degradation by generating debugstore and memory transactions.

Please let me know how it goes. I have other options in mind to experiment with for throttling on Pentium 4 😀

Reply 3 of 13, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie

Fantastic! I will have to test if King's Quest 6 works properly with this (it does not work properly on fast systems and any tool I've tried so far has failed to fix it).

mslrlv.png
(Decommissioned:)
7ivtic.png

Reply 4 of 13, by OttoPS

User metadata
Rank Newbie
Rank
Newbie

Hi! The case KQ6 is similar to Ultima VII's. I would first try "p4tool cd dsbts" and if it still looks too fast, I would try adding some ODCM steps (e.g., "p4tool cd dsbts o7").

I haven't tried Cedarmill yet, let me know how it goes.

Reply 5 of 13, by Shponglefan

User metadata
Rank l33t
Rank
l33t

Did some initial benchmarking with 3DBench 1.0c using P4Tool. I specifically tested the cache disabling, DS, and DS+BTS settings.

I'm using a P4 651 Cedar Mill (D0) processor. It also allows for different multpliers ranging from 12x to 17x. I used CPUSPD to change the multiplier settings.

One thing I did notice was that if I tried running CPUSPD after using P4TOOL, my computer would reboot. So I needed to use CPUSPD beforehand to set the multiplier, then run P4TOOL.

No memory managers (e.g. EMM386) were loaded during these benchmarks. The FPS reported by 3DBench before using P4Tool is 394.0 (17x multiplier) and 375.2 (12x multiplier).

Based on these results, it gives a nice range of scores approximating a mid-range Pentium at the high end to a 386 at the low end. Changing the multiplier allows for a lot of granularity in between.

Pentium 4 Multi-OS Build
486 DX4-100 with 6 sound cards
486 DX-33 with 5 sound cards

Reply 6 of 13, by Shponglefan

User metadata
Rank l33t
Rank
l33t

Did some further testing. In particular, I wanted to test out Descent and Dark Forces. Descent generally requires a Pentium-level system or the in-game movement is too fast. Dark Forces has an issue with General MIDI where too fast a system causes it to freeze upon playback.

Previously I was using ODCM for these on my Pentium 4. With P4Tool, I just used the Debug Store (DS) setting. I also installed JEMM386 and HIMEMX, in place of EMM386 and HIMEM.SYS.

Both Decent and Dark Forces worked great with this throttling method. Descent's in-game movement felt more like a Pentium and Dark Forces' General MIDI playback worked perfectly.

So far I'm impressed. This new tool makes the Pentium 4 even more versatile when it comes to throttling. I do strongly believe that the Pentium 4 (esp. D0 Cedar Mill processors) are one of the single best retro platforms for this reason.

Pentium 4 Multi-OS Build
486 DX4-100 with 6 sound cards
486 DX-33 with 5 sound cards

Reply 7 of 13, by Shponglefan

User metadata
Rank l33t
Rank
l33t

Did some more testing and benchmarking. I wanted to compare performance between actual Pentium systems and P4Tool using the Debug Store (DS) option.

For this comparison I enabled Debug Store and tested the P4 at both 17x and 12x multipliers. I compared these results to prior Pentium 133 and Pentium 90 benchmarks I've done.

Pentium 133 specs: Gigabyte GA-586ATE/P motherboard, Diamond Stealth64 2001 (S3 Trio 64+), 32MB RAM
Pentium 90 specs: Intel Premiere/PCI II motherboard, Diamond Stealth 64 (S3 Vision964), 16MB RAM

JEMM386 and HIMEMX was enabled on the Pentium 4 for these benchmarks.

The attachment Pentium 4 651 (Cedar Mill) vs Pentium 133 vs Pentium 90.png is no longer available

Benchmark results varied somewhat. 3D Bench was fastest on the Pentium 133, whereas Terminal Velocity was faster on the Pentium 4. But this gives a good idea of the relative performance of the Debug Store throttling effects relative to real Pentium systems.

Pentium 4 Multi-OS Build
486 DX4-100 with 6 sound cards
486 DX-33 with 5 sound cards

Reply 8 of 13, by Shponglefan

User metadata
Rank l33t
Rank
l33t

More benchmarks, this time testing the low end of performance with cache disabled. First chart is using the 17x multiplier for the Pentium 4, second chart is with 12x multiplier. JEMM386 and HIMEMX was used for these benchmarks.

For comparison, I included benchmarks from my 486 DX-33 and 386 DX-40 systems.

At the lowest level of performance with 12x multiplier, DS+BTS option, and cache fully disabled, the Pentium 4 is about the level of a 386 DX-40.

Pentium 4 Multi-OS Build
486 DX4-100 with 6 sound cards
486 DX-33 with 5 sound cards

Reply 9 of 13, by OttoPS

User metadata
Rank Newbie
Rank
Newbie
Shponglefan wrote on 2026-04-04, 18:27:
Did some initial benchmarking with 3DBench 1.0c using P4Tool. I specifically tested the cache disabling, DS, and DS+BTS setting […]
Show full quote

Did some initial benchmarking with 3DBench 1.0c using P4Tool. I specifically tested the cache disabling, DS, and DS+BTS settings.

I'm using a P4 651 Cedar Mill (D0) processor. It also allows for different multpliers ranging from 12x to 17x. I used CPUSPD to change the multiplier settings.

One thing I did notice was that if I tried running CPUSPD after using P4TOOL, my computer would reboot. So I needed to use CPUSPD beforehand to set the multiplier, then run P4TOOL.

No memory managers (e.g. EMM386) were loaded during these benchmarks. The FPS reported by 3DBench before using P4Tool is 394.0 (17x multiplier) and 375.2 (12x multiplier).

Based on these results, it gives a nice range of scores approximating a mid-range Pentium at the high end to a 386 at the low end. Changing the multiplier allows for a lot of granularity in between.

Hi! These benchmarks are very useful for improving P4Tool. Currently, I only have one motherboard with socket 478, so I can't test all the Pentium 4 CPU families.

Regarding the incompatibility with CPUSPD, it might be related to the handling of MSRs, since CPUSPD may be unaware of the configurations used by P4Tool but present in the same MSRs used by both tools.
I'm going to add SpeedStep multiplier configurations to P4Tool to avoid these issues by centralizing all possible performance parameters for NetBurst and eliminating the need to use both tools in parallel.
I'll let you know here when the new release is ready if you'd like to try it.

Thank you very much for your testing!

Reply 10 of 13, by mockingbird

User metadata
Rank Oldbie
Rank
Oldbie
OttoPS wrote on 2026-04-01, 14:13:

Hi! The case KQ6 is similar to Ultima VII's. I would first try "p4tool cd dsbts" and if it still looks too fast, I would try adding some ODCM steps (e.g., "p4tool cd dsbts o7").

Thanks, got around to testing, it's completely broken for me. My CPU is a Pentium 4-M (Northwood). The symptom is that after I run the utility (with any parameter), I just get a blinking cursor and the system seems frozen.

EDIT: Hanging solved by booting with only HIMEM and not EMM386 (my fault for not reading the documentation). I will have to add a boot option for JEMMEX and continue my testing. Will report back.
EDIT: "p4tool cd dsbts" resulted in the odd throttling you get with cache disabling schemes so I abandoned that path. I tested with "dsbts o7" - but that's not enough to prevent crashing in KQ6. So enabling Debug Storage/Branch Trace Storage with ODCM isn't sufficient in this case.
EDIT:...nevertheless, the slowdown is still significant with only 'dsbts'. Without it, my P4 @ 1.6Ghz scores at around the level of a 1Ghz Coppermine in Speedsys. With only dsbts enabled, it scores a tiny bit faster than an AM5x86-133.

Last edited by mockingbird on 2026-04-05, 20:38. Edited 2 times in total.

mslrlv.png
(Decommissioned:)
7ivtic.png

Reply 11 of 13, by Shponglefan

User metadata
Rank l33t
Rank
l33t
OttoPS wrote on 2026-04-05, 17:41:

I'm going to add SpeedStep multiplier configurations to P4Tool to avoid these issues by centralizing all possible performance parameters for NetBurst and eliminating the need to use both tools in parallel.
I'll let you know here when the new release is ready if you'd like to try it.

Thank you very much for your testing!

Adding a multiplier option would be awesome. I'm happy to test it out, thank you for creating an awesome tool! 😀

Pentium 4 Multi-OS Build
486 DX4-100 with 6 sound cards
486 DX-33 with 5 sound cards

Reply 12 of 13, by OttoPS

User metadata
Rank Newbie
Rank
Newbie

About DS/BTS and why it causes extreme slowdown on NetBurst

Since this came up while testing p4tool, I wanted to share some findings about Debug Storage (DS) and Branch Trace Storage (BTS) and their impact on performance.

DS/BTS are hardware tracing mechanisms used for debugging and profiling. They allow recording the execution flow of branches at runtime.

On NetBurst systems, enabling BTS can cause very large slowdowns, far beyond what would be expected from memory overhead alone.

Some relevant discussions from Intel:

https://community.intel.com/t5/Software-Archi … ance/m-p/995654
https://community.intel.com/t5/Software-Tunin … tore/m-p/806100

In particular, Intel engineering mentioned:

BTS requires the processor to clear the pipeline on every taken branch and to drain the memory subsystem to ensure correct ordering of store events.

This has a major impact on NetBurst due to its architecture:

  • Very deep pipeline
  • Heavy reliance on speculative execution
  • Strong dependence on branch prediction
  • Trace cache front-end (uop cache based on predicted paths)

If the pipeline is cleared on every taken branch, the CPU is effectively forced to:

  • Continuously discard speculative work
  • Rebuild execution state
  • Lose most instruction-level parallelism

This turns what would normally be a highly speculative, high-throughput design into something much closer to serialized execution.

An important observation is that:

  • The slowdown is not primarily caused by memory bandwidth
  • Even with buffers in cache, the penalty remains very high
  • The main cost appears to be internal to the core (pipeline + execution flow disruption)

In practice, this results in extremely large slowdowns in tight loops, sometimes by two orders of magnitude.

Interestingly, this behavior can be used as a way to reduce CPU execution speed without proportionally affecting bus or video throughput, which makes it quite different from ODCM or cache disabling approaches.

Reply 13 of 13, by Shponglefan

User metadata
Rank l33t
Rank
l33t

That’s good information to know about how Debug Store works. I’m amazed it causes such dramatic slow down, but at the same time not complaining. It seems ideal for throttling Pentium 4 systems into Pentium or 486 range. Also good to know that it doesn’t affect video or bus speeds.

Speaking of which, I did some further testing of a couple more speed sensitive titles, Blackthorne and Dynablaster.

Blackthorne has an issue where music won’t play if the system speed is too fast. This typically affects the intro cutscene and/or in-game level music. Previously I would just disable cache to get things working. I tried Blackthorne with both the Debug Store and DS + BTS options. In both cases, music played back just fine.

Dynablaster has a speed sensitive issue where it will fail to detect the sound card or freeze up when playing digital sound. Like Blackthorne, I would usually just disable cache. But trying both the DS and DS+BTS options, Dynablaster worked perfectly.

Pentium 4 Multi-OS Build
486 DX4-100 with 6 sound cards
486 DX-33 with 5 sound cards