VOGONS


First post, by mkarcher

User metadata
Rank l33t
Rank
l33t

A recurring theme in 486 CPU identification is the inherent difficulty to distinguish an Am486DX4 NV8T operating at 2*50MHz from one operating at 3*33MHz. To show that this problem is solvable, I wrote a tool that uses microbenchmarking to detect the clock multiplier. It uses the fact that the 486 waits for a new FSB cycle to begin if an instruction needs to fetch data from outside of the chip (i.e. an L1 cache read miss). This means that the synchronizing instruction can take some extra core clocks to execute depending on the phase of the FSB clock at the start of the instruction. My tool tries to detect how many different phases exist - this is the clock multiplier.

See https://github.com/karcherm/486mult for that tool, including source code (but the source code does not include comments explaining the theory of operation). Release 1.0 includes the EXE file I used to test on various 486 CPUs.

The approach will not work if L1 cache is disabled (likely also if there is no L1 cache, so I don't think it could identify a clock-doubled 386, if a chip like that even exists), but I expect it to correctly pick up clock doubling on SLC2-type 486 processors. The primary assumptions are that a NOP takes 1 cycle to execute and that the L1 cache is not higher associative than 4 ways, and the cache size is 16K or less. This should apply to all 486-class processors. It is known to fail on a Cyrix 5x86.

Reply 1 of 43, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie

I just downloaded a .ZIP and there does not seem to be a .EXE, where is it please?

I wonder if it's possible to detect a SX954 @ 66 MHz is a SX954 rather then a SX955. I have a fake SX807 that probably is a SX954.

Reply 2 of 43, by mkarcher

User metadata
Rank l33t
Rank
l33t
red-ray wrote on 2025-04-18, 23:52:

I just downloaded a .ZIP and there does not seem to be a .EXE, where is it please?

I'm sorry, I should have been more explicit. I mentioned "Release 1.0", which you can reach from the right side-bar in the project overview. The EXE file is not part of the version-tracked source files, but added as additional artifact when I prepare a "Release". See this link: https://github.com/karcherm/486mult/releases/tag/v1.0

Reply 3 of 43, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie

OK, are you aware Windows 10/11 is not enamoured with the .EXE, you should/can you digitally sign it?

I feel you should also specify which environment it needs, I guess DOS, would a Windows NT V4.00 SP6a command window be OK?

Last edited by red-ray on 2025-04-19, 02:23. Edited 2 times in total.

Reply 4 of 43, by jakethompson1

User metadata
Rank Oldbie
Rank
Oldbie

It's almost certainly a DOS program; no way to sign it even if you wanted to.

Reply 5 of 43, by mkarcher

User metadata
Rank l33t
Rank
l33t
red-ray wrote on 2025-04-19, 02:02:

OK, are you aware Windows 10/11 is not enamoured with the .EXE, you should/can you digitally sign it?

The last Windows version that can officially execute on 486 processors is Windows 2000 for x86 in 32 bits. I don't care how later Windows versions treat that EXE file, as it does not make sense to be run on a newer processor. The detecti0n method is specifically tied to the 486 core design, and will break on anything newer. Even the Cyrix 5x86 is detected as "1x", because the more advanced execution unit doesn't have the "bottleneck" my code tries to make use of.

red-ray wrote on 2025-04-19, 02:02:

I feel you should also specify which environment it needs, I guess DOS.

Yeah, it's a DOS executable. It's possible that it works in NTVDM on an unloaded machine, but for reliable results, the tool expects "CLI" to do what it is supposed to do, and it also expects direct access to the timer hardware (Ports 42h, 43h and the "system control port" 61h, bits 0 and 1, which affect timer and speaker operation).

Reply 6 of 43, by jakethompson1

User metadata
Rank Oldbie
Rank
Oldbie

To the extent that the signature issue makes it a hassle to download the file on a modern Windows machine, I should point out that Windows 10+ come with curl out of the box, so downloading it from the command line (where the "mark of the web" alternate data stream, which is causing all those issues above, won't be set) is an option.

Reply 7 of 43, by mkarcher

User metadata
Rank l33t
Rank
l33t
jakethompson1 wrote on 2025-04-19, 02:10:

It's almost certainly a DOS program; no way to sign it even if you wanted to.

Well, I could put the "real tool" in the "DOS stub" of a Win32 executable, and try to add a hack in the Win32 part that runs the DOS stub in NTVDM (which should be possible in some more or less elegant way, but I currently have no idea how). I then can sign the executable, and if I understand the Authenticode specification correctly, that signature even includes the DOS stub part. Of course, this would require me to have an Authenticode key/certificate (which I currently don't).

Reply 8 of 43, by jakethompson1

User metadata
Rank Oldbie
Rank
Oldbie
mkarcher wrote on 2025-04-19, 02:34:
jakethompson1 wrote on 2025-04-19, 02:10:

It's almost certainly a DOS program; no way to sign it even if you wanted to.

Well, I could put the "real tool" in the "DOS stub" of a Win32 executable, and try to add a hack in the Win32 part that runs the DOS stub in NTVDM (which should be possible in some more or less elegant way, but I currently have no idea how). I then can sign the executable, and if I understand the Authenticode specification correctly, that signature even includes the DOS stub part. Of course, this would require me to have an Authenticode key/certificate (which I currently don't).

Heh, but then would it break under Win32s!

Yes, Authenticode includes the DOS stub... it excludes the PE checksum, the signature directory part of the PE header, and the signature itself. Which has some interesting consequences

Reply 9 of 43, by mkarcher

User metadata
Rank l33t
Rank
l33t
jakethompson1 wrote on 2025-04-19, 02:38:
mkarcher wrote on 2025-04-19, 02:34:
jakethompson1 wrote on 2025-04-19, 02:10:

It's almost certainly a DOS program; no way to sign it even if you wanted to.

Well, I could put the "real tool" in the "DOS stub" of a Win32 executable, and try to add a hack in the Win32 part that runs the DOS stub in NTVDM (which should be possible in some more or less elegant way, but I currently have no idea how). I then can sign the executable, and if I understand the Authenticode specification correctly, that signature even includes the DOS stub part. Of course, this would require me to have an Authenticode key/certificate (which I currently don't).

Heh, but then would it break under Win32s!

Not just Win32s, but likely the whole Win9x family, which doesn't have NTVDM, but some other method to spawn DOS programs under the VMM "hypervisor". As there is no real security against malicious software in the Win9x kernel design, I'm confident I could find a way to execute the DOS stub of a valid Win32 unter Win9x, too. And just for fun, also in Win3.11 + Win32s. But at the moment, I have more important things to do than design a way to sign DOS executables. Nevertheless, this is an interesting thought, and if anyone feels inclined, feel free to pick up the idea. Attribution would be nice, but isn't required.

You could even think of checking for DOSbox installations on Win64 to run the DOS stub there. While this wouldn't make sense for the 486 clock detection utility (which will not do anything useful on the emulated DOSbox CPU), it could still be used to create a "super universal executable" that somehow works from DOS 2.0 (or 3.0) up to the latest Windows version as long as DOSbox is installed.

Reply 10 of 43, by jakethompson1

User metadata
Rank Oldbie
Rank
Oldbie

Out of curiosity, I tried, and a .PIF pointing to a Win32 executable triggers the "This program cannot be run in DOS mode" in Win95, so long as the 'Prevent MS-DOS programs from detecting Windows' flag is set.

Reply 11 of 43, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
mkarcher wrote on 2025-04-19, 02:28:
red-ray wrote on 2025-04-19, 02:02:

OK, are you aware Windows 10/11 is not enamoured with the .EXE, you should/can you digitally sign it?

The last Windows version that can officially execute on 486 processors is Windows 2000 for x86 in 32 bits. I don't care how later Windows versions treat that EXE file, as it does not make sense to be run on a newer processor. The detecti0n method is specifically tied to the 486 core design, and will break on anything newer. Even the Cyrix 5x86 is detected as "1x", because the more advanced execution unit doesn't have the "bottleneck" my code tries to make use of.

Yes, that is not the point, it's likely that Windows 10/11 will be used to download the program and then the user will transfer it to the i486 system.

VT was happy with it, see https://www.virustotal.com/gui/file/4d11cc12c … 5e6c6?nocache=1

Last edited by red-ray on 2025-04-19, 09:05. Edited 1 time in total.

Reply 12 of 43, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie

I tried to sign it, but this failed.

file.php?id=217276

I also tried it on my AMD Am5x86, it ran OK, but reported x1 most of the time.

I next tried real DOS 6.22 and it reported "Measurement failed. Strange de-turbo mode ?". When I pressed Ctrl/Alt/Del PARITY ERROR was displayed on the screen.

file.php?id=217279

Reply 13 of 43, by Disruptor

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2025-04-19, 09:00:

I also tried it on my AMD Am5x86, it ran OK, but reported x1 most of the time.

I next tried real DOS 6.22 and it reported "Measurement failed. Strange de-turbo mode ?". When I pressed Ctrl/Alt/Del PARITY ERROR was displayed on the screen.

Is your L1 cache on?
Which FSB do you use, and how are your L2-cache and memory timings?

May you try to run the tool on real DOS/Win98-DOS without any memory managers loaded?
During boot press F8 and select to boot to DOS prompt, so even no HIMEM.SYS is loaded?

Thank you in advance and Happy Easter!

Reply 14 of 43, by mkarcher

User metadata
Rank l33t
Rank
l33t
red-ray wrote on 2025-04-19, 09:00:

I also tried it on my AMD Am5x86, it ran OK, but reported x1 most of the time.

I next tried real DOS 6.22 and it reported "Measurement failed. Strange de-turbo mode ?". When I pressed Ctrl/Alt/Del PARITY ERROR was displayed on the screen.

I am very confused by the program causing a parity error. This program does nothing special to hardware that could cause a parity error, it just accesses memory like every task in DOS does and interfaces with the timer chip and general ouput port like any DOS application that would play PC speaker sound. The program does not touch any chipset configuration, cache configuration or other low-level stuff. I suggest you check your DOS installation for the presence of the "parity boot" virus, which is known to cause spurious "PARITY ERROR" messages. On the other hand, my tool might be buggy and inadvertently flip some bits in the general output port (I access that port for timer control) that are meant to enable/disable parity checking, so a virus is not the only explanation.

Thanks for posting screnshots that include the diagnostic numbers.

The DOS interpretation is the "correct" one: My method doesn't work on your system. The measurements in Windows are noisy (as the tool expects a single-tasking environment), and as I didn't include proper filtering and validation of the result, the noise in interpreted as "x1". The 8 numbers indicate the time (in units of 838ns) required to execute a test loop with an increasing amount of NOPs inside it. On a x1 486 processor, a linear raise of the time is expected. If a higher multiplier is active, stair steps are expected. On a x4 system like your system, I expect some roughly equal values (less than 4), then 4 roughly equal but bigger values, and finally a third set of even bigger values. That's what the set of numbers is supposed to look like, and the length of the first complete step is taken as multiplier.

As for why it fails on your system, I observe that there is no cleary ascending pattern in the numbers, which also matches the DOS behaviour. So the loop is not bottlenecked on the load instructions as I expected it to be. If all 8 numbers are roughly the same, the "strange de-turbo mode" message is printed, as I only happened to observe this behaviour with the de-turbo mode of the UMC8881 chipset. The test is known to fail with L1 disabled, but L1 is clearly enabled, as you won't get down to around 7000 to 8000 ticks execution time with L1 disabled. There obviously are some interactions between the chipset and the processor that are not properly considered in the method design yet. I'm going to test the tool on different 486 systems later, but don't expect a fix for that soon.

The system I initially tested the tool on had the L2 timing set to 3-2-2-2 fixed in the setup, as I am oftentimes changing the FSB on that board, and I have that timing just to be safe. I should have cross-validated the method at faster L2 timings, but I didn't. Possibly that's why it fails on your system, as slower L2 timings might indeed affect what the loop is bottlenecked on, and faster L2 cache might remove the FSB bottleneck my code relies on.

Reply 15 of 43, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
mkarcher wrote on 2025-04-19, 11:02:

I am very confused by the program causing a parity error.

I booted DOS and typed Ctrl/Alt/Del right away and this time it went back to the NT boot selection menu so I suspect there isn't a virus.

I then ran 486mult this time on my Intel 486 DX2 SX911 @ 80 MHz, got as attached and Ctrl/Alt/Del again went back to the NT boot selection menu.

On NT it reported x1.

BTW does your program check if the L1 cache is enabled via CR0? If not I feel it would be wise to do this, report it's state and also the 486MULT version number. It's also easy enough to work out the size.

Reply 16 of 43, by mkarcher

User metadata
Rank l33t
Rank
l33t
red-ray wrote on 2025-04-19, 12:04:

I then ran 486mult this time on my Intel 486 DX2 SX911 @ 80 MHz, got as attached and Ctrl/Alt/Del again went back to the NT boot selection menu.

Plain DOS was my target environment, and I'm happy the program works as intended in that environment on that system at least. You see "stairsteps" of around 122 after every other number. That's how the utility is supposed to work.

red-ray wrote on 2025-04-19, 12:04:

On NT it reported x1.

This means I clearly need to implement denoising for the tool to report valid numbers in NT. As NT likely ignores my CLI instruction, there is interrupt-related noise in the timings. I think about making the loop short enough so that there is a good chance to be executed without interruptions, and then repeating the measurements multiple times and take the fastest result as valid measurement. I am quite confident this approach can make it work in NT.

red-ray wrote on 2025-04-19, 12:04:

BTW does your program check if the L1 cache is enabled via CR0? If not I feel it would be wise to do this, report it's state and also the 486MULT version number. It's also easy enough to work out the size.

At the moment, my idea was to have that program as a minimal proof of concept. Unless I set Borland C++ to use 286 opcodes, it would likely perform perfectly even on an 8088, although the timing values are likely to overflow there, and you would get x1 all the time. The assembler file contains a .286 or .386 directive, but I am quite sure I removed all the experiments with 286-type IMUL instructions, so it likely is 8088-clean as well.

Adding a check whether the tool is executed on a 486 (i.e. probing for the AC flag and CPUID non-presence or model 4) might be a good idea, though. Checking L1 in CR0 is also low effort enough to include it.

On the other hand, checking L1 size is considered out-of-scope. My algorithm is adapted to a cache of 4 ways of 4KB each, i.e. the Intel DX4 / Am5x86 cache. It also happens to work with caches that have less ways or a way size that divided 4K, so the 4 ways of 2K setup of the classic 486DX will work, as will the single way of 1K of the classic Cx486SLC. I don't need to know the cache size to tune the measurements, and I am confident my algorithm doesn't fail due to cache size constraints on any processor with a 486-type core (includes the Am5x86 and all the SLC chips as long as they have L1 enabled; excludes the PODP an Cx5x86), so determining the cache size doesn't provide any benefit to the multiplier determination.

On the other hand, FSB and core clock can be easily derived once I have a valid "stair structure", and might be displayed in an upcoming version.

Reply 17 of 43, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
mkarcher wrote on 2025-04-19, 21:04:

NT likely ignores my CLI instruction, there is interrupt-related noise in the timings.

Thank you for your detailed reply. I am pretty sure NT will ignore the CLI instruction and I would be inclined to put the code into a driver, depending on exactly how long the code takes I may also raise the IQRL. As there could be multiple CPUs I would also set the affinity, that said I am not aware of SIV ever been run on an i486 system with multiple CPUs.

I am wondering about reworking your code to be 32-bit and using it in SIV to work out the multiplier, would you be happy for me to do this?

I wonder, can the 32-bit CR0 be read when running in 16-bit mode? There is smsw that will read the low 16 bits, but Cache Disable is bit 30. smsw is quite handy as it works in user mode, I recently started using it in SIV to detect if there is an FPU on Windows 9x systems as 9x does not have IsProcessorFeaturePresent().

Reply 18 of 43, by mkarcher

User metadata
Rank l33t
Rank
l33t
red-ray wrote on 2025-04-19, 21:42:

Thank you for your detailed reply. I am pretty sure NT will ignore the CLI instruction and I would be inclined to put the code into a driver, depending on exactly how long the code takes I may also raise the IQRL. As there could be multiple CPUs I would also set the affinity, that said I am not aware of SIV ever been run on an i486 system with multiple CPUs.

My TODO list is currently:

  • replace the fixed ITERATIONS value in GM.ASM by a variable, and calibrate it to get a total runtime of around 1000 ticks (below 1ms). Possibly this means I also should adjust the tolerance in GETMUL.C that is used to decide whether values are considered to be equal.
  • try whether adding mov ah, [...+0Ch] after mov ah, [...] helps to remove the "flatlines", even if I set the base constant to zero.The approach is to load the first byte of the cache line which will initiate a standard 4-cycle (so 5 clocks or more) burst on any kind of 486, even new Cyrix ones that do not include Intel's patented "interleaved burst" order, and then use the second read that hits a byte in the last DWORD of that burst. The aim is to not just start fetching the cache line via the FSB from L2 (or RAM), but then wait for the fetch to be finished. This should ensures that the FSB is idle at the "test location" (after the repeated NOP instructions, before the first load), so that location definitely encounters the "resync delay" which is the primary value my detection approach needs to measure.
  • run each mearurement repeatedly until I got 5 results that are within 1% of the minimum value I got. Limit at 100 tries (after calibration, this is around 0.1s). If the loop is not successful (i.e. I got a very small value that I couldn't reproduce, so no "5 low results"), consider the low value to be a spurious outlier (maybe an interrupt hit and the code took more than 55ms to execute, overflowing the timer or maybe a timer mis-read, there are some 8254 clones that are known for read reliability issues, although I guess that my approach of stopping the timer first via the general output port and then reading it will eliminate most of the reliability issues). In case a suspected spurious low value is encountered, just retry the loop up to 5 times (worst case: 5 times without a clear result, i.e. 0.5s). Error out after 5 retries. If not errored out, return the average (or sum?) of the five low samples.
  • Up the number of different NOP amounts from 8 to 9, to make sure I see both the start and the end of a size-4 step in case the steps happen to be aligned to the measurement window (i.e. 4 low samples, 4 high sample now, which will become 4 low samples, 4 medium samples, 1 high sample).
  • Properly validate the "stair steps", and not blindly output the length of the first suspected step. Error out if inconsistent. Maybe this means I should increase the window size to 13, not just 9, to get guaranteeed two complete steps at x4, which is needed to properly confirm that x4 seems valid.
  • Verify running on a 486 processor
  • Check whether CR0 indicates that L1 is active (if that is still requried after adding the [+0ch] loads, also test with aligning the whole test loop eliminates the "L1 must be on" restriction)

I'm hoping that the third bullet (repeated measurement) will discard all samples in which the loop was interrupted, so I can get a clean result even without resorting to CLI, allowing the code to work on any operating system without requiring kernel level access. If the algorithm is ported to an NT driver, you can likely get away without all that stuff. Binding to a specific core is definitely a good idea, as someone might do asymmetric multiprocessing using two processors with different multipliers. I don't have a 486 multiprocessor machine to test something like that, though. The biggest iron I have from the 486 era is a Compaq ProSignia (original model, no number, internal Compaq model number 3080), which is still just single socket.

I'm afraid that I'm likely not able to work on those TODOs in April, though.

  • I am wondering about reworking your code to be 32-bit and using it in SIV to work out the multiplier, would you be happy for me to do this?

Yeah. Attribution in the README, about dialog or something like that would be nice. Preferred way would be to link to my GitHub user profile or directly to the 486mult project, but isn't required. Legally, I wouldn't have a case anyway, as I didn't apply for patent protection for that algorihtm, and only specific implementations are copyrightable, so a rewrite of the same algorithm as 32-bit code would be very hard to be proven as "derivative work", as the idea is so simple, and if you adapt it to use NT kernel timing functions instead of direct I/O access (I expect the NT kernel has some QueryPerformanceCounter equivalent, and If you chose to use the retry-in-userspace variant, I just recommend QPC anyway), your implementation is just a straightforward implementation of a simple algorithm that doesn't share any "creative parts" with my 16-bit code.

red-ray wrote on 2025-04-19, 21:42:

I wonder, can the 32-bit CR0 be read when running in 16-bit mode? There is smsw that will read the low 16 bits, but Cache Disable is bit 30.

That's no issue. You can issue (pun intended) any 32-bit instruction from 16-bit mode by just prefixing it with the 32-bit prefix 66h. There might be instructions that will not work in real mode (there are instructions that do not make sense in real mode, like ARPL), though. Using these 32-bit instructions from real mode is the recommended way to initially configure the processor anyway. Reading the documentation of the CRx move instructions by Felix Cloutier, it even says that the 32-bit prefix is unnecessary, as this instruction is executed as 32-bit instruction even in 16-bit mode.

Reply 19 of 43, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
mkarcher wrote on 2025-04-20, 06:45:

someone might do asymmetric multiprocessing using two processors with different multipliers.
I don't have a 486 multiprocessor machine to test something like that, though. The biggest iron I have from the 486 era is a Compaq ProSignia

Thank you for your reply and insight into 16-bit code, I first developed on an i386 back around 1990, but only ever developed 32-bit code.

In kernel mode it's KeQueryPerformanceCounter() which is much the same as in user mode, on my i486 system it's running at 1.193 MHz which should be fast enough, on recent systems it's 10.000 MHz or more.

I have in mind to add to the CLK/FSB speed section on the [Help] panel, are you happy with as below? I feel it would be wise if a waited 'till you have refined your code before I start implementing it as 32-bit.

file.php?id=217368

Your comment about asymmetric multiprocessing reminded me that I have only come across one such system and it was back in 2006, it had an AMD Athlon M8 (Thoroughbred) @ 2.00GHz and an AMD Duron M8 (Thoroughbred) @ 1.60GHz ! As I recall the guy said SIV was the only program that reported it correctly, using the save file test mode it still does, but I would like to know what currently happens on real asymmetric hardware.

Your ProSignia sounds line an interesting system, do you have NT installed on it? If so SIV save files from it would be very interesting. I am currently compiling a table of what GetSystemInfo() reports as wProcessorRevision for i486 CPUs that don't have CPUID, for my 486DX-33 SX729 CPUID 0404+ my AMD 486 DX2 it's FFD0.