Intel RapidCad compatibility issues

Reply 20 of 41, by rasz_pl

Posted on 2024-07-14, 04:16

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 4208
Joined: 2017-06-04, 00:57

MSxyz wrote on 2024-07-13, 15:26:

With Quake 1.06, a 40MHz Am386DX + Cyrix 83D87 is capable of 2.0 frame per seconds on average. Swap the 386 with a Cyrix 486DLC and the framerate increases to 2.5. With the Intel RapidCAD, also running at 40 MHz, the framerate increases to 3.1

for completeness you should compare to real 486DX 40MHz with similar graphic card/chipset/ram timings

MSxyz wrote on 2024-07-13, 15:26:
since having the FPU integral to the CPU saves a lot of cycles.

Is there something special about being integrated? I dont think so. CPU-FPU communication speed is down to protocol:
- 8087 FPU ran in lockstep with CPU snooping on main CPU bus and taking over when necessary, switching CPU-FPU by bus mastering CPU bus?
- 287/387 FPU uses different mode, something about message passing? special 0F0-0FF I/o ports? Switching CPU-FPU takes tens of cycles and is realized by raising exceptions? above my knowledge level
- Weitek 4167 were using yet another different communication mechanism, memory mapped 64KB window at ~3GB address. Communication at full external bus speed, much faster than 387
- afaik 486DX copro didnt change the way coprocessor communicates from 387? its as slow as 386-387?
- Pentium drastically changed CPU-FPU communication protocol again.

Quake with Weitek support would be interesting 😀

MSxyz wrote on 2024-07-13, 15:26:

To me this stuff is fascinating...

Definitely. I also love obscure outdated technical knowledge!

mkarcher wrote on 2024-07-13, 22:35:

Q_memcpy at that line is inside an #if 0 block, so it is not compiled.

how do you mean? the only ifdefs in common.c are for

1#ifdef PARANOID
2#if defined(_WIN32)
3#if WINDED

and Q_memcpy is used all over the place
vga palette https://github.com/id-Software/Quake/blob/bf4 … /vid_dos.c#L285
draw_pic https://github.com/id-Software/Quake/blob/bf4 … ent/draw.c#L366
models/skins https://github.com/id-Software/Quake/blob/bf4 … t/model.c#L1358
its ~half/half between Q_memcpy and memcpy in those files, like they couldnt make their minds up or abandoned optimizations half way (no point optimizing something not in hot loops) or deeming them inconsequential (realized watcom memcopy generated same code).

mkarcher wrote on 2024-07-13, 22:35:
I could not find any traces of FPU memcpy in Quake 1.08 for DOS, especially not for VID_Update, so the use of FPU memcpy in Quake is likely a myth.

I first read that meme in @Jo22 posts Re: 2D Acceleration - first chipsets and never understood where he got the idea from.
Now the question is - would that even speed up Quake? Here is FPU memcopy implementation http://www.pennelynn.com/Documents/CUJ/HTML/1 … HAM1/DURHAM.HTM promising 16% gain on memory-memory transfers. Tthis number is with pre warmed cache so not real world, without it its
"No Self-warming, float register copy, Pentium 90
(All cases were worse than memcpy)"
It took Intel another ~15 years to finally start optimizing and recommending "rep movsb" as guaranteed full cache speed memory move operation.

https://github.com/raszpl/FIC-486-GAC-2-Cache-Module for AT&T Globalyst
https://github.com/raszpl/386RC-16 memory board
https://github.com/raszpl/440BX Reference Design adapted to Kicad
https://github.com/raszpl/Zenith_ZBIOS MFM-300 Monitor

Reply 21 of 41, by MSxyz

Posted on 2024-07-14, 07:13

MSxyz Offline

Rank Member

Rank: Member
Posts: 140
Joined: 2024-02-07, 16:42

rasz_pl wrote on 2024-07-14, 04:16:

for completeness you should compare to real 486DX 40MHz with similar graphic card/chipset/ram timings

Based on the results with a 50MHz 486DX (6.5 fps), frame rate at 40MHz should be around 5 FPS.

rasz_pl wrote on 2024-07-14, 04:16:

Is there something special about being integrated? I dont think so. CPU-FPU communication speed is down to protocol.

Off memory, I remember something about an external x87 FPU needing around a dozen of bus cycles just for communication with the CPU.

Reply 22 of 41, by mkarcher

Posted on 2024-07-14, 08:26

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3287
Joined: 2019-01-19, 16:29
Location: Germany

rasz_pl wrote on 2024-07-14, 04:16:
Is there something special about being integrated? I dont think so. CPU-FPU communication speed is down to protocol: - 8087 FPU […]
Show full quote

MSxyz wrote on 2024-07-13, 15:26:
since having the FPU integral to the CPU saves a lot of cycles.

Is there something special about being integrated? I dont think so. CPU-FPU communication speed is down to protocol:
- 8087 FPU ran in lockstep with CPU snooping on main CPU bus and taking over when necessary, switching CPU-FPU by bus mastering CPU bus?
- 287/387 FPU uses different mode, something about message passing? special 0F0-0FF I/o ports? Switching CPU-FPU takes tens of cycles and is realized by raising exceptions? above my knowledge level
- Weitek 4167 were using yet another different communication mechanism, memory mapped 64KB window at ~3GB address. Communication at full external bus speed, much faster than 387
- afaik 486DX copro didnt change the way coprocessor communicates from 387? its as slow as 386-387?
- Pentium drastically changed CPU-FPU communication protocol again.

The bottleneck in the 287/387 protocol is the front side bus. While the 8087 ran in lockstep and already had the instructions to execute at hand, the port-based protocol made the coprocessor just another target on the front side bus. The 286 had to fetch the FPU operation first, and then forward it to the 287 through a port. Should the FPU instruction require data, the 286 did fetch/store the data to memory verifying all the segment limit protected mode stuff, whereas the 287 would just communicate the data from/to the 286, so data also had to traverse the FSB twice.

The main reason the 486 had an integrated FPU, in my oppinion, is access to the L1 cache. You can't access the 486 L1 cache from the FSB. Having the FPU integrated also means that you don't need to block the FSB for data transfer from the CPU to the FPU.

The issue with the FPU receiving commands through the FSB is the same for the Weitek 3167/4167. But their protocol used the address lines to transfer the command and the 32 data lines to transfer a 32-bit data value (for example a single-precision float) in the same FSB cycle. Furthermore, that protocol did not rely on the CPU to poll a "ready" bit, which was performed in microcode, IIRC. Cyrix made the EMC87 a variant of their 387 that included a Weitek-like (not compatible, just a similar idea) protocol, and go great lengths in their datasheet to point out the advantage of that protocol. I did look into that datasheet some time ago, and IIRC you would need dedicated mainboard support to make (best?) use of the MMIO protocol.

mkarcher wrote on 2024-07-13, 22:35:

Q_memcpy at that line is inside an #if 0 block, so it is not compiled.

rasz_pl wrote on 2024-07-14, 04:16:
how do you mean? the only ifdefs in common.c are for […]
Show full quote

how do you mean? the only ifdefs in common.c are for
1#ifdef PARANOID
2#if defined(_WIN32)
3#if WINDED

Seems we are looking at different source files? https://github.com/id-Software/Quake/blob/bf4 … t/common.c#L136 is an "#if 0".

rasz_pl wrote on 2024-07-14, 04:16:

mkarcher wrote on 2024-07-13, 22:35:
I could not find any traces of FPU memcpy in Quake 1.08 for DOS, especially not for VID_Update, so the use of FPU memcpy in Quake is likely a myth.

I first read that meme in @Jo22 posts Re: 2D Acceleration - first chipsets and never understood where he got the idea from.

Quake using the FPU for memcpy is "common knowledge" since the 90s. I can't tell where I heard it first. Oftentimes it was exaggregated to "sucks on non-Pentium computers because the FPU of the Pentium is way faster. And by the way, it is mostly used for memcpy." We already know that "mostly for memcpy" is clearly wrong, because of the clever use of FPU/CPU multitasking during perspective correction. Yet the clam about using the FPU for memcpy at all might still have been true.

rasz_pl wrote on 2024-07-14, 04:16:

Now the question is - would that even speed up Quake? Here is FPU memcopy implementation http://www.pennelynn.com/Documents/CUJ/HTML/1 … HAM1/DURHAM.HTM promising 16% gain on memory-memory transfers. Tthis number is with pre warmed cache so not real world, without it its
"No Self-warming, float register copy, Pentium 90
(All cases were worse than memcpy)"

That is a different use case to what was believed to be in Quake. First, it is about copying data to (write-back cached) memory instead of uncached PCI memory. As I already explained, the key point of FPU memcpy is to drive 64-bit FSB cycles to the FSB, which is primarily important if the target is on the PCI bus. Second, they chickened out and used FILD/FIST because they were afraid of floating point exceptions. The fastest FPU memcpy is reportedly obtained by using FLD/FST to load/store 64-bit floating point values. All 64-bit floating point values have equivalent representations in the internal 80-bit format of the x87 FPU, so there could be a lossless conversion, and just loading/storing the data does not need any calculation performed on them, so having some NaN values mixed in does not matter. There is one fine detail, though: According to the IEEE spec, converting from 64 bit to 80 bit is considered a floating point operation, and if a "signalling NaN" is involved in any kind of operation, an exception is raised. The idea is that signalling NaNs represent special values the FPU can't handle, and operations involving them are to be emulated in software. As the FPU may not assume anything about the bit layout of a signalling NaN, it may not silently transform a 64-bit signalling NaN into a 80-bit signalling NaN. You can mask this exception, though. In that case, the FPU converts any signalling NaN into a "quiet NaN" by flipping one bit. Using an appropriate color palette, the effect of this occasional bit-flip could be masked, so FLD/FST-based memcpy is still viable for suitable video data, although it is not viable for general-purpose data.

Guess how I know? I tried running some version of 3DMark on a 486, and it errored out on loading textures. The reason is that they use "the fastest available memcpy" to copy around data from their compressed archives. And "the fastest available" in their case is interpreted as "FPU memcpy on non-MMX processors, MMX memcpy on Pentium MMX/Pro/II, SSE memcpy on Pentium III". Using FPU memcpy (FLD/FST based) on compressed data is a very bad idea, though...

Reply 23 of 41, by rasz_pl

Posted on 2024-07-14, 11:02

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 4208
Joined: 2017-06-04, 00:57

mkarcher wrote on 2024-07-14, 08:26:

The main reason the 486 had an integrated FPU, in my oppinion, is access to the L1 cache. You can't access the 486 L1 cache from the FSB. Having the FPU integrated also means that you don't need to block the FSB for data transfer from the CPU to the FPU.

do you know if 486 still internally uses I/O transactions as main cpu-fpu communication mechanism? Looking at 486 block diagrams its still a separate unit, but now connected to internal 64bit data bus (under control of BIU) so it suggests it does have direct cache access.

hmm RapidCad doesnt have L1 at all, therefore MSxyz your Quake result makes sense when compared to 486 with disabled L1 [Intel RapidCAD, Floating Point Boost for 386DX Setups VS 486DX Benchmarks (Quake, Doom..)] CPU Galaxy https://www.youtube.com/watch?v=UM4NPh5qg8Y

mkarcher wrote on 2024-07-14, 08:26:

Seems we are looking at different source files?

indeed
QW/client/ https://github.com/id-Software/Quake/blob/bf4 … t/common.c#L136
WinQuake/ https://github.com/id-Software/Quake/blob/bf4 … e/common.c#L154

mkarcher wrote on 2024-07-14, 08:26:

That is a different use case to what was believed to be in Quake. First, it is about copying data to (write-back cached) memory instead of uncached PCI memory. As I already explained, the key point of FPU memcpy is to drive 64-bit FSB cycles to the FSB, which is primarily important if the target is on the PCI bus.

would it make any difference with PCI target (vga card) almost an order of magnitude slower than theoretical PCI throughput? Does PCI have build in fifos or will VGA card ultimately throttle CPU to its max write speed during the transfer?

mkarcher wrote on 2024-07-14, 08:26:

...NaN...
Guess how I know? I tried running some version of 3DMark on a 486, and it errored out on loading textures. The reason is that they use "the fastest available memcpy" to copy around data from their compressed archives. And "the fastest available" in their case is interpreted as "FPU memcpy on non-MMX processors, MMX memcpy on Pentium MMX/Pro/II, SSE memcpy on Pentium III". Using FPU memcpy (FLD/FST based) on compressed data is a very bad idea, though...

So is FLD/FST possible or not? :] and if it is then why would 'compressed data' be of any significance? either its transparent or its not, right?

https://github.com/raszpl/FIC-486-GAC-2-Cache-Module for AT&T Globalyst
https://github.com/raszpl/386RC-16 memory board
https://github.com/raszpl/440BX Reference Design adapted to Kicad
https://github.com/raszpl/Zenith_ZBIOS MFM-300 Monitor

Reply 24 of 41, by MSxyz

Posted on 2024-07-14, 12:04

MSxyz Offline

Rank Member

Rank: Member
Posts: 140
Joined: 2024-02-07, 16:42

What I don't understand is why Quake v106 runs fine (even if it's predictably slow - RapidCad is a 486DX with L1 cache disabled and using the 386 bus), but with v108 I've this stuttering and temporary freezes to the point that the timedemo register 0.8 - 1.0 fps instead of the 3.1 - 3.3 I'm expecting (v108 is usually around 5% faster than v106). A shame I don't have a VGA video capture card, otherwise I would post a comparison video between v106 and v108. It's as if v108 has changed something in the rendering routines that isn't properly executed by the RapidCAD.

Reply 25 of 41, by rasz_pl

Posted on 2024-07-14, 12:19

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 4208
Joined: 2017-06-04, 00:57

Would be worth asking some Quake hacking nerds. maybe someone like https://github.com/neozeed (made Quake 2 dos port) would know, or ViTi95 (fastdoom) knows someone optimizing Quake (there was someone crazy enough to rewrite quake for 486 some time ago)

hmm are you testing with sound? sound code has huge potential for this type of glitches

https://github.com/raszpl/FIC-486-GAC-2-Cache-Module for AT&T Globalyst
https://github.com/raszpl/386RC-16 memory board
https://github.com/raszpl/440BX Reference Design adapted to Kicad
https://github.com/raszpl/Zenith_ZBIOS MFM-300 Monitor

Reply 26 of 41, by MSxyz

Posted on 2024-07-14, 12:42

MSxyz Offline

Rank Member

Rank: Member
Posts: 140
Joined: 2024-02-07, 16:42

rasz_pl wrote on 2024-07-14, 12:19:

hmm are you testing with sound? sound code has huge potential for this type of glitches

Nope. Nothing installed but the video card and the disk controller card. For this kind of benchmarks I disable everything that is not strictly necessary. I don't have sound, mouse, CDROM, COM/LPT. Pure DOS 6.22 with HIMEM and nothing else. I've success running Quake DOS on any CPU from 386+387 up to (including) a Pentium IV Willamette. Never seen something like that.

-Tried at different CPU speeds 25/33/40 Mhz
-Tried with different VGA chips: ET4000, WD90C30, T9000B ISA and ET4000W32 VLB
-Tried on two (actually 3 but one had other issues, so I don't count it) MBs with different chipsets.

So it's either a software glitch or my RapidCAD is somehow faulty, but in such a subtle way that it shows up only in very specific instances.

I've bought a fourth 386 motherboard off ebay and I will be testing it as soon as it arrives. If I can reproduce the same bugs even with this one, I will call it a day and just record my experience is my personal "diary of an oldskool nerd". No way I'm going to look for and buy another RapidCAD. These things sells for over 300 $/€ and the one I have is a perfect display case specimen, as both the silkscreen and the engravings are perfect.

Reply 27 of 41, by Disruptor

Posted on 2024-07-14, 13:03

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1750
Joined: 2018-03-22, 18:31
Location: European Union

rasz_pl wrote on 2024-07-14, 11:02:

So is FLD/FST possible or not? :] and if it is then why would 'compressed data' be of any significance? either its transparent or its not, right?

Not all patterns of data are copied correctly. mkarcher has mentioned 'compressed data' because in his example the result of a FPU memcopy was garbage. The verify code of the decompressor has detected corrrupted data. mkarcher has found out that some data patterns were affected by a wrong copy through FPU.
But if you examine this issue in detail you can avoid being affected by this kind of data corruption.

Last edited by Disruptor on 2024-07-14, 15:36. Edited 1 time in total.

Reply 28 of 41, by mkarcher

Posted on 2024-07-14, 15:24

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3287
Joined: 2019-01-19, 16:29
Location: Germany

rasz_pl wrote on 2024-07-14, 11:02:

mkarcher wrote on 2024-07-14, 08:26:

The main reason the 486 had an integrated FPU, in my oppinion, is access to the L1 cache. You can't access the 486 L1 cache from the FSB. Having the FPU integrated also means that you don't need to block the FSB for data transfer from the CPU to the FPU.

do you know if 486 still internally uses I/O transactions as main cpu-fpu communication mechanism? Looking at 486 block diagrams its still a separate unit, but now connected to internal 64bit data bus (under control of BIU) so it suggests it does have direct cache access.

I don't know, but as the block diagram I look at has a direct connection from the instruction decoder to the FPU, but no connection from the BIU to the FPU, I can't see how the FPU would receive instructions through I/O cycles, but instead it will get them directly from the decoder. According to Cyrix, this is a big win, because the instruction dispatch was a bottleneck they addressed in their EMC87 as well. The same is true for the data path. Having direct instruction decoder / cache access starts to get extremely important with clock-multiplied versions.

rasz_pl wrote on 2024-07-14, 11:02:

hmm RapidCad doesnt have L1 at all, therefore MSxyz your Quake result makes sense when compared to 486 with disabled L1 [Intel RapidCAD, Floating Point Boost for 386DX Setups VS 486DX Benchmarks (Quake, Doom..)] CPU Galaxy https://www.youtube.com/watch?v=UM4NPh5qg8Y

No L1, but integrated FPU. That's an interesting design choice. I expect a 486DLC + 387 to beat the RapidCad in most applications. Maybe Intel did want to sell a CAD upgrade to 386 customers without creating too strong competition to their 486 processors? Maybe they used the RapidCad as way to re-cycle dies with broken L1?

rasz_pl wrote on 2024-07-14, 11:02:

So is FLD/FST possible or not? :] and if it is then why would 'compressed data' be of any significance? either its transparent or its not, right?

Maybe I was not clear enough. FLD/FST is not transparent. No discussion needed about that fact. So memcpy based on FLD/FST is actually not memcpy. On the other hand, looking at the non-transparency in detail reveals that it only applies to certain bit patterns: If the high byte of a QWORD is 0x7F or 0xFF, and the next lower byte is in the range 0xF0..0xF7. In that case, the lower byte is increased by 8, to 0xF8..0xFF. If you know this property beforehands, you can prepare for it, for example by only using 254 colors, omitting 0x7F and 0xFF, or by using a palette in which 0xF0 and 0xF8, 0xF1 and 0xF9 and so on are sufficiently similar such that an ocassional replacement is not noticable in animated graphics.

Reply 29 of 41, by MSxyz

Posted on 2024-07-14, 16:21

MSxyz Offline

Rank Member

Rank: Member
Posts: 140
Joined: 2024-02-07, 16:42

mkarcher wrote on 2024-07-14, 15:24:

No L1, but integrated FPU. That's an interesting design choice. I expect a 486DLC + 387 to beat the RapidCad in most applications

It does.
Other than Quake, all other benchmarks, synthetic or not, have the 486DLC easily surpassing the RapidCAD by a decent margin. The L1 cache plays a big role in the performance of the 486. I'm spending the weekend away from home, but tomorrow I can post some data...

Reply 30 of 41, by pshipkov

Posted on 2024-07-15, 03:37

pshipkov Offline

Rank l33t

Rank: l33t
Posts: 2190
Joined: 2018-10-11, 05:08

It is fascinating indeed to be able to peek back in time through hardware components in the physical world and/or code in the digital one.
Wonder what is the story of q_memcpy function. It is inactive in the main branch it seems, which make sense, since it replicates what Watcom's memcpy does anyway.
Was it carried over from previous projects and kept for code compatibility ? Kind of unlikely. It's signature is the same as the standard C memcpy, so quick search/replace and is over.
Maybe it was needed for building the code on a platform with more rudimentary compilers ? There were console ports - Nintendo, SPS, etc. Never looked at what's there in terms of software toolchains.
This will need some searching and reading online.
Anyhow.

As for the RapidCAD CPU. I also ran bunch of tests with it.
Notes here.
In short. Intel's RapidCad does not overclock very well. 45MHz seems to be the upper limit, but gets sketchy.
On a clock-to-clock basis:
RapidCad's CPU is about 20% faster than 386DX and massively faster than 387.
TI SXL2 (and IBM BL3) offer much faster CPU performance than RapidCad.
Depends on which FPU tests/apps you look at, RapidCAD's FPU performance is similar or massively better than a DLC+387 combo.

retro bits and bytes

Reply 31 of 41, by rasz_pl

Posted on 2024-07-15, 04:17

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 4208
Joined: 2017-06-04, 00:57

mkarcher wrote on 2024-07-14, 15:24:

I don't know, but as the block diagram I look at has a direct connection from the instruction decoder to the FPU, but no connection from the BIU to the FPU, I can't see how the FPU would receive instructions through I/O cycles, but instead it will get them directly from the decoder

yep, I was looking at some artistic impression from google instead of Intel datasheet 🙁

mkarcher wrote on 2024-07-14, 15:24:

No L1, but integrated FPU. That's an interesting design choice. I expect a 486DLC + 387 to beat the RapidCad in most applications.

feipoa couldnt get any of 386 socketed 486s to beat 3.1 fps in Quake Re: FPU 387 questions even at double mhz, so 486 definitely has faster FPU-CPU communication

mkarcher wrote on 2024-07-14, 15:24:

. No discussion needed about that fact. So memcpy based on FLD/FST is actually not memcpy. On the other hand, looking at the non-transparency in detail reveals that it only applies to certain bit patterns: If the high byte of a QWORD is 0x7F or 0xFF, and the next lower byte is in the range 0xF0..0xF7. In that case, the lower byte is increased by 8, to 0xF8..0xFF. If you know this property beforehands, you can prepare for it, for example by only using 254 colors, omitting 0x7F and 0xFF, or by using a palette in which 0xF0 and 0xF8, 0xF1 and 0xF9 and so on are sufficiently similar such that an ocassional replacement is not noticable in animated graphics.

oh wow, that is hardcore insane 😁 Those special cases would have had huge implications for shading/texturing code and would probably end up not worth it, especially considering FILD/FIST was measurably slower to begin with.

I still cant believe it took Intel 23 frickin years to acknowledge programmers need a guaranteed dedicated fast path for moving data around.
Enhanced REP MOVSB (ERMSB) was introduced in 2012 with Ivy Bridge, and even then it was still slower than unrolled movs, not to mention hacky SS2 variants. Skylake almost brought it in line for bigger transfers.
2019 Fast Short REP MOVSB (FSRM) was supposed to be the bees knees. And then we got Reptar https://lock.cmpxchg8b.com/reptar.html :]

Speaking of History of dedicated fast paths for moving data:

- 1982 Intel 186/286 'rep movsw' at 2 cycles per byte. Brilliant, then intel drops the ball for 20 years 😮

- Commodore never did anything to help programmers, 6502 had nothing. Even the latest 1988 barely used in anything 65CE02 didnt address fast copy.

- 1986 WDC W65C816 had dedicated 'Move Memory' MVN/MVP, except it was hilarious 7 cycles per byte

- 1987 NEC TurboGrafx-16/PC Engine dedicated 6502 clone by HudsonSoft HuC6280 implemented dedicated data move instructions 'Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII)'
at hysterical 6 cycles per byte plus 17 cycles startup. (17 + 6x) = ~160KB/s at 7.16 MHz CPU. For comparison IBM XT with 4.77 MHz NEC V20 does >300KB/s

- 1993 Pentium 'rep movsd' at theoretical 4 bytes per cycle

- 1997 Pentium MMX ?

https://github.com/raszpl/FIC-486-GAC-2-Cache-Module for AT&T Globalyst
https://github.com/raszpl/386RC-16 memory board
https://github.com/raszpl/440BX Reference Design adapted to Kicad
https://github.com/raszpl/Zenith_ZBIOS MFM-300 Monitor

Reply 32 of 41, by MSxyz

Posted on 2024-07-15, 05:35

MSxyz Offline

Rank Member

Rank: Member
Posts: 140
Joined: 2024-02-07, 16:42

OK, here's some data from last week tests:

Motherboard: Unknown Manufacturer BC3486UL v1.2
Chipset: UMC UM82C481/UM82C482
BIOS Date: 12/12/91
CPU: AMD Am386DXL 40 MHz / Texas Instruments 486DLC 40 MHz / Intel RapidCAD OC 40 MHz
FPU: Cyrix Fastmath 83D87 / Internal
External Cache: 8 x 32KB, 15ns
Main Memory: 8 x 1MB, 30 Pin, 60ns
I/O Card: Acer SAB-560 Maximum AT
Video card: Genoa Systems 7900 - Tseng Labs ET4000AX 1MB ISA

386 benchmarks

Wolf3D (286 ver.) : 1925 Ticks - 25.8 FPS
Doom (screenblocks=10) : 10304 RealTicks - 7.2 FPS
Quake v1.06 (Mode 0) : 496.5 Seconds - 2.0 FPS
NSSI Dhrystone : 11943
NSSI Whetstone : 2604
Norton SysInfo 8 : 43.1

486DLC benchmarks

Wolf3D (286 ver.) : 1467 Ticks - 33.9 FPS
Doom (screenblocks=10) : 6677 RealTicks - 11.2 FPS
Quake v1.06 (Mode 0) : 392.0 Seconds - 2.5 FPS
NSSI Dhrystone : 17281
NSSI Whetstone : 5646
Norton SysInfo 8 : 65.1

RapidCAD benchmarks

Wolf3D (286 ver.) : 1787 Ticks - 27.8 FPS
Doom (screenblocks=10) : 8893 RealTicks - 8.4 FPS
Quake v1.06 (Mode 0) : 315.6 Seconds - 3.1 FPS
NSSI Dhrystone : 12355
NSSI Whetstone : 4323
Norton SysInfo 8 : 42.2

Reply 33 of 41, by douglar

Posted on 2024-07-15, 18:30

douglar Offline

Rank l33t

Rank: l33t
Posts: 2092
Joined: 2019-11-04, 15:37

MSxyz wrote on 2024-07-15, 05:35:

OK, here's some data from last week tests:

Sorry if I missed this earlier in the thread, but what BIOS & Motherboard are you using?

Did you need to do any tweaking with Cyrix.exe for the 486SLC ?

Your numbers came out a touch better that the scores that I got with MRBIOS 1.65 on a micromation-tech-80386 with a 10Mhz ISA bus and running c:\cyrix\cyrix.exe -e -b -xA000,128

Re: 486DLC cache coherency blues and headaches

I was testing with a ISA ATI Mach64 (109- 19301-10) that is fast in DOS but probably not the very fastest for DOS.

Reply 34 of 41, by mkarcher

Posted on 2024-07-15, 19:11

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3287
Joined: 2019-01-19, 16:29
Location: Germany

rasz_pl wrote on 2024-07-15, 04:17:

feipoa couldnt get any of 386 socketed 486s to beat 3.1 fps in Quake Re: FPU 387 questions even at double mhz, so 486 definitely has faster FPU-CPU communication

I suppose the main bottleneck is FPU-to-memory transfer. On a 387, the data has to pass the FSB twice. On the RapidCad, it passes the FSB just once. In a real 486 with L1 and the FPU in one package, it wouldn't even transfer the FSB at all if you get an L1 hit. That's also likely why at the NSSI Whetstones benchmark the 486DLC+387 combo wins, but in Quake it loses. Quake pushes a lot of data through the FPU during the geometry calculation, while the calculations itself are quite simple. Quake does not even try to keep value in FPU registers. Random example: Take a look at line 156 and 157, they are:

1		VectorSubtract (hull->clip_mins, mins, offset);
2		VectorAdd (offset, ent->v.origin, offset);

The functions VectorSubtract and VectorAdd take two input vectors by reference (first two arguments) and store the result in the output vector (third argument) which is also taken by reference. In the Quake source code, vec3_t is an alias for a 3-element float array, so C's usual pass-by-reference semantics for arrays apply here. The disassembly (with the register that points to the "hull" object renamed to "rHull", the register that points to the "ent" object renamed to "rEnt", the regster that points to the "mins" vector renamed to "rMins" and the register that points to the "offset" vector renamed "rOffset" looks like this:

1                fld     [rHull+hull.mins.x]
2                fsub    [rMins+vec3.x]
3                fstp    [rOffset+vec3.x]
4                fld     [rHull+hull.mins.y]
5                fsub    [rMins+vec3.y]
6                fstp    [rOffset+vec3.y]
7                fld     [rHull+hull.mins.z]
8                fsub    [rMins+vec3.z]
9                fstp    [rOffset+vec3.z]
10                fld     [rOffset+vec3.x]
11                fadd    [rEnt+edict.v_origin.x]
12                fstp    [rOffset+vec3.x]
13                fld     [rOffset+vec3.y]
14                fadd    [rEnt+edict.v_origin.y]
15                fstp    [rOffset+vec3.y]
16                fld     [rOffset+vec3.z]
17                fadd    [rEnt+edict.v_origin.z]
18                fstp    [rOffset+vec3.z]

This code is needlessly spilling the intermedate offset vector into memory. That's cheap on a Pentium if you hit the L1WB cache. That's expensive on the RapidCad, as the data is transferred across the two times more than required, and it's terrible on a 386/387 system, as the unnecessary load/store pushes the date four times across the FSB (387 to 386, 386 to L2, L2 to 386 and finally 386 to 387). No amount of clock multiplication is going to fix that.

Note: This observation obviously does not apply to the hand-optimized assembler code, just to the compiler generated FPU stuff.

rasz_pl wrote on 2024-07-15, 04:17:

mkarcher wrote on 2024-07-14, 15:24:

... for example by only using 254 colors, omitting 0x7F and 0xFF, or ...

oh wow, that is hardcore insane 😁 Those special cases would have had huge implications for shading/texturing code and would probably end up not worth it, especially considering FILD/FIST was measurably slower to begin with.

Not at all for 256-color Quake. Quake uses a color map (loaded from "colormap.lmp") that maps the internal pixel representation at different light levels to the 256 colors of the VGA card (loaded from "palette.lmp"). You could easily generate a slightly lower quality 254-color palette and adjust the mapping table accordingly. This will not require any changes to the Quake rendering engine.

Reply 35 of 41, by jakethompson1

Posted on 2024-07-16, 00:10

jakethompson1 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1768
Joined: 2015-11-17, 04:16

mkarcher wrote on 2024-07-14, 08:26:

The issue with the FPU receiving commands through the FSB is the same for the Weitek 3167/4167. But their protocol used the address lines to transfer the command and the 32 data lines to transfer a 32-bit data value (for example a single-precision float) in the same FSB cycle. Furthermore, that protocol did not rely on the CPU to poll a "ready" bit, which was performed in microcode, IIRC. Cyrix made the EMC87 a variant of their 387 that included a Weitek-like (not compatible, just a similar idea) protocol, and go great lengths in their datasheet to point out the advantage of that protocol. I did look into that datasheet some time ago, and IIRC you would need dedicated mainboard support to make (best?) use of the MMIO protocol.

Speaking of those, is there any reason you couldn't retrofit local bus video onto an ISA 486 via that socket? I guess you would need a ribbon cable off to an ISA card to supply any missing signals from the ISA side.

Reply 36 of 41, by rasz_pl

Posted on 2024-07-16, 00:40

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 4208
Joined: 2017-06-04, 00:57

mkarcher wrote on 2024-07-15, 19:11:

VectorSubtract (hull->clip_mins, mins, offset);
VectorAdd (offset, ent->v.origin, offset);
Note: This observation obviously does not apply to the hand-optimized assembler code, just to the compiler generated FPU stuff.

Knowing Abrash all the non hand optimized code lies outside the heaviest hot loops. Like SV_HullForEntity is for clipping presumably dynamic objects (models, camera). There are maybe up to what, 10-20 3d objects max on the screen at any time in Q1? Highest cost in Q1 was most likely texturing followed by calculating/rotating PVS or maybe span/edge calculations, then bsp tree walk, then all the single % stuff.

jakethompson1 wrote on 2024-07-16, 00:10:

Speaking of those, is there any reason you couldn't retrofit local bus video onto an ISA 486 via that socket? I guess you would need a ribbon cable off to an ISA card to supply any missing signals from the ISA side.

Re: VLB with 386DX-40 ?
Re: Diy VL bus
Re: What hasn’t been done?

https://github.com/raszpl/FIC-486-GAC-2-Cache-Module for AT&T Globalyst
https://github.com/raszpl/386RC-16 memory board
https://github.com/raszpl/440BX Reference Design adapted to Kicad
https://github.com/raszpl/Zenith_ZBIOS MFM-300 Monitor

Reply 37 of 41, by MSxyz

Posted on 2024-07-16, 06:31

MSxyz Offline

Rank Member

Rank: Member
Posts: 140
Joined: 2024-02-07, 16:42

douglar wrote on 2024-07-15, 18:30:

MSxyz wrote on 2024-07-15, 05:35:

OK, here's some data from last week tests:

Sorry if I missed this earlier in the thread, but what BIOS & Motherboard are you using?

It's in the post above 😀 it's a <unknown manufacturer> BC3486UL https://theretroweb.com/motherboards/s/cachin … ch-cor-bc3486ul

It's a dual 386 - 486 motherboard that accepts 386DX - 486DLC - 486SX -486DX from 20 to 50MHz. AMI BIOS late 91, I'm not using any Cyrix utility. The motherboard has dedicated jumpers for the 486DLC.

It has an UMC 82C480 chipset. Petty fast chipset, although it seems it uses the L2 cache in "dirty write back mode" and also the main DRAM timings are not that great (and BIOS settings doesn't seem to affect them, although there're plenty of options to tweak). I've a ECS FX3000 motherboard that uses the same chipset that is even faster, although it's for 486 only.

Reply 38 of 41, by Disruptor

Posted on 2024-07-16, 08:04

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1750
Joined: 2018-03-22, 18:31
Location: European Union

jakethompson1 wrote on 2024-07-16, 00:10:

Speaking of those, is there any reason you couldn't retrofit local bus video onto an ISA 486 via that socket? I guess you would need a ribbon cable off to an ISA card to supply any missing signals from the ISA side.

rasz_pl wrote on 2024-07-16, 00:40:

Re: VLB with 386DX-40 ?
Re: Diy VL bus
Re: What hasn’t been done?

At least I have an ET4000/W32 graphics card for the ISA bus.

Reply 39 of 41, by mkarcher

Posted on 2024-07-16, 16:02

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3287
Joined: 2019-01-19, 16:29
Location: Germany

MSxyz wrote on 2024-07-16, 06:31:

It has an UMC 82C480 chipset. Petty fast chipset, although it seems it uses the L2 cache in "dirty write back mode" and also the main DRAM timings are not that great (and BIOS settings doesn't seem to affect them, although there're plenty of options to tweak). I've a ECS FX3000 motherboard that uses the same chipset that is even faster, although it's for 486 only.

You can add a x1 dirty tag RAM to the UM82C480 to get it into sensible write back mode. IIRC there is at least one VOGONs thread about that.

Main menu