A brief comparison of 386 FPUs

Reply 100 of 148, by feipoa

Posted on 2019-05-05, 01:02

feipoa Offline

Rank l33t++

Rank: l33t++
Posts: 10273
Joined: 2011-03-07, 13:54
Location: Canada

pshipkov wrote:
actual render times are not in the render dialog (gray background), but when Esc, then at the bottom of the screen -white text on cyan background. figured i should clarify that.

I'm not sure I follow. I only ever see one time stamp. I must press the space bar to get the time displayed. If I press esc, it goes back to the line drawing and I don't see any time displayed. Might help if you provide photos of what you are conveying.

pshipkov wrote:
I welcome your intention to run the tests on a native 386 cpu+mobo, otherwise upgrade kits and stuff make these hybrid systems that tend to "blur the picture".

The AMI Mark V Baby Screamer is a native 386 motherboard and chipset. The ALi m1429 is a 386/486 hybrid board, but my board only has a PGA-132 + FPU. I generally have the most interest in upgrade modules rather than a true 386DX. We differ in this regard.

Plan your life wisely, you'll be dead before you know it.

Reply 101 of 148, by noshutdown

Posted on 2019-05-05, 03:37

noshutdown Offline

Rank Oldbie

Rank: Oldbie
Posts: 1302
Joined: 2010-07-23, 17:04
Location: China

feipoa wrote:

I'm not sure I follow. I only ever see one time stamp. I must press the space bar to get the time displayed. If I press esc, it goes back to the line drawing and I don't see any time displayed. Might help if you provide photos of what you are conveying

the rendered time is in the cyan zone at the bottom of screen when you return to the line drawing(3d editor).

Reply 102 of 148, by pshipkov

Posted on 2019-05-05, 06:11

pshipkov Offline

Rank l33t

Rank: l33t
Posts: 2206
Joined: 2018-10-11, 05:08

@Feipoa
A proper VLSI 33x board !
Did you ever try to OC it ?
I am sorry to hijack the tread, just really curious if this chipset is capable of going beyond the 40ies ?
Thanks.

retro bits and bytes

Reply 103 of 148, by feipoa

Posted on 2019-05-05, 08:10

feipoa Offline

Rank l33t++

Rank: l33t++
Posts: 10273
Joined: 2011-03-07, 13:54
Location: Canada

Okay, I see the number that you are referring to now. What then is the duration I have underlined in green? The numbers are pretty close, so my percentages shown in the previous posts should still be within reason. 10:46 vs. 11:02

I have only run the AMI Mark V Baby Screamer with an 80 MHz crystal oscillator, so with an FSB of 40 MHz. The manual is pretty explicit about cache speed. For 33 MHz, 15 ns is required, for 40 MHz, 12 ns is required. It needs ten 64Kx4 pieces and one 64Kx1 piece. For a 45 MHz FSB, I'm guessing 10 ns would be required? I doubt 10 ns exist in these SRAM configurations.

I do have an 87.5 MHz crystal oscillator, but I have not yet tried it in this board, or any board. I originally bought the 87.5 MHz oscillator in hope of running the SXL2-66 at 87.5 MHz. But I suppose I could also use it to run an SXL-40 at 43.75 MHz, but the ISA bus clock would be at 10.9 MHz. I'm not sure if my ISA cards can handle that speed reliably, not sure if my SRAM is fast enough, and not sure if the VLSI chipset can handle this. I have added it to my to-do list though.

The attachment 3DS4_rendering_1.jpg is no longer available

The attachment 3DS4_rendering_2.jpg is no longer available

Plan your life wisely, you'll be dead before you know it.

Reply 104 of 148, by noshutdown

Posted on 2019-05-05, 09:48

noshutdown Offline

Rank Oldbie

Rank: Oldbie
Posts: 1302
Joined: 2010-07-23, 17:04
Location: China

feipoa wrote:
Okay, I see the number that you are referring to now. What then is the duration I have underlined in green? The numbers are pretty close, so my percentages shown in the previous posts should still be within reason. 10:46 vs. 11:02

the time that you underlined in green is the time of the previous frame you rendered, especially in case of rendering an animation of many frames, when you are rendering the 3rd frame, it displays the time for 2nd frame.

Reply 105 of 148, by galanopu

Posted on 2021-04-26, 02:43

galanopu Offline

Rank Newbie

Rank: Newbie
Posts: 97
Joined: 2020-10-28, 08:45
Location: EU

My 386 FPU collection is very basic compared to what I see here.
I have 4 of them, so I tested them and did this video:
https://www.youtube.com/watch?v=Tv9LVhlo7EM

The extra thing here is that I also tried some Overclock here... Enjoy.

Last edited by galanopu on 2021-04-26, 16:10. Edited 1 time in total.

Let's mod everything! Check my youtube channel:
https://www.youtube.com/channel/UCZ6ULBqIKhxuNslAbqFNJUg
Interested in my devices? Check my store:
https://migronelectronics.bigcartel.com

Reply 106 of 148, by Deunan

Posted on 2021-04-26, 10:33

Deunan Offline

Rank l33t

Rank: l33t
Posts: 2088
Joined: 2018-05-29, 12:32

It goes to show that on 386 any fast FPU is mostly limited by the CPU<->NPU bus, and the CPU having to fetch everything from memory thus placing a hard limit on performance. A DLC or SXL chip with internal cache enabled can feed the NPU from it's cache, that's why there is considerable improvement in results on such setups.

I'd be very wary of conformance test results, these programs often were written for Intel silicon and expect that to be the norm, and not the actual standard (which is in some places not defined well enough to make a judgement). Also, ULSI chips tend to ignore the precision settings and calculate everything in the extended precision. That violates the IEEE-754 technically but the reason for the setting in the first place was to save some time in programs that don't need the extra precision. If the NPU can compute the result at highest precison with no performance penalty should we demand it adhere to the standard? Especially if that could maybe actually slow it down, by having to implement the core differently?

And then there's the problem of accuracy in transcendental functions, that is not really covered well by IEEE. In other words, fastest chip is not always the best one, and the slowest doesn't have to be the most accurate.

Reply 107 of 148, by elianda

Posted on 2021-04-26, 10:44

elianda Offline

Rank l33t

Rank: l33t
Posts: 2515
Joined: 2006-04-21, 16:56
Location: Hannover / Germany

Are there applications known that make use of the specific fast operations in certain FPUs?

Like the 4x4 matrix transformation in the IIT FPUs?

Retronn.de - Vintage Hardware Gallery, Drivers, Guides, Videos. Now with file search
Youtube Channel
FTP Server - Driver Archive and more
DVI2PCIe alignment and 2D image quality measurement tool

Reply 108 of 148, by pshipkov

Posted on 2021-04-26, 15:33

pshipkov Offline

Rank l33t

Rank: l33t
Posts: 2206
Joined: 2018-10-11, 05:08

Considering the extensive (and available) list of supported video cards in these old 3D applications, we can conclude that developers made similar effort regarding FPUs, especially if support for them was available in common compilers and toolchains.
Looking at Elianda's provided example - the construct is so simple, does not convolute code, i would use it to maximize perf of my application on computers with IIT FPUs.

There is plenty of fpus and stuff in the "retro bits and bytes" link.

retro bits and bytes

Reply 109 of 148, by Deunan

Posted on 2021-04-27, 08:44

Deunan Offline

Rank l33t

Rank: l33t
Posts: 2088
Joined: 2018-05-29, 12:32

elianda wrote on 2021-04-26, 10:44:

Like the 4x4 matrix transformation in the IIT FPUs?

I recognize that README file, it's from IIT demo disk. And there's a simple demo program for 4x4 matrix multiplication on there too.

Then there's the well-know document called coproc.txt that I often recommend since it answers a lot of questions that people have about early NPUs. It says IIT 4x4 operation was not really widely supported and mentions two apps that made use of it, as well as the base of that claim:

As desirable as the F4X4 instruction may
seem, however, there are very few applications that make use of it when
an IIT coprocessor is detected at run time (among them Schroff
Development's Silver Screen and Evolution Computing's Fast-CAD 3-D
[25]).
(...)
[25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Professionell, Juni
1991, Seiten 214-237

Personally I see one big problem with the 4x4 operation - it needs all the data to be fed to FPU and then the results be read back. AFAIR you need to start with empty stack, use IIT-specific stack extension instructions, and some of the input arguments are over-written to store the result. This, coupled with the rather slow CPU-NPU comm channel, limits the usefulness of such instructions. Weitek worked around that by having their NPU register space memory-mapped, at the cost of even lower compatibility with typical x87 code.

Long story short: It took MMX to finaly have some direct access to FPU register space, and even that was flawed due to cost of switching between MMX and x87 modes. SSE finally made FPU on x86 family somewhat saner by today's standards.

Reply 110 of 148, by Deunan

Posted on 2021-05-11, 23:40

Deunan Offline

Rank l33t

Rank: l33t
Posts: 2088
Joined: 2018-05-29, 12:32

I finally had some time last weekend to compare IIT 4C87DLC-40 against Cyrix CX-83D87-40. Test were done on Am386DX-40 system. I plan to also try Cx486DLC-40 as the CPU, and another IIT 3C87-40 sometime later - to see what, if any, differences there are.

For the benchmark I'm using my own programs that calculate fractals, because a) the usual benchmarks suck, b) I'm too lazy to look for proper cracks of old 3D Studio 4, and c) I also made versions that work on 286 systems, and non-PC machines like FM-Towns. Point is, my code is really using the x87 a lot and not just additions and multiplications. I had hoped this would bring out any differences between the chips and it really does so.

The Cyrix chip is considerably faster. I won't bother you with all the details but the most demanding benchmark (which is written in C so also stresses the CPU a bit more than the hand-crafted assembly) took 18m:18s.09 on Cyrix and 28m:49s.16 on IIT for 640x480 resolution. Note this code does not use any of the IIT extensions, I actually need to come up with something that would require matrix operations, or would greatly benefit from the multiple stack banks first - and one could argue that would be cherry picking. But I will do it, eventually.

In case someone wants to have a point of reference for their own systems, I've also run Quake 1.06 demo from the DOSBENCH package at 320x200 using ISA Trident TVGA8900D with 1MiB of VRAM (for 32-bit internal bus operation). Only 8MiB of RAM though but using CF card. This test was done earlier, however, at 33MHz because I was also testing a Ti486DLC-33 CPU.
IIT score on 386DX-33: 969 frames 717.7 seconds 1.4 fps
IIT score on 486DLC-33: 969 frames 585.8 seconds 1.7 fps
Cyrix score on 386DX-33: 969 frames 602.0 seconds 1.6 fps
Cyrix score on 486DLC-33: 969 frames 474.8 seconds 2.0 fps

And it might also be of interest that a 16-bit version of my benchmark, run in 320x200, took 87m:21s.044 on 12MHz 80286 with Intel D80287-10 - but only 41m:40s.592 with Intel C80287XL. The XL chip is not only faster, it also runs with 1:2 clock divider (so effectively at the same speed as CPU) instead of the usual 1:3 in sync mode. And actually being a 387SX core in 40-pin package it can also do some things easier (like calculate sin(x) direcly, or accept wider range of input parameters) so in that mode it takes 33m:51s.858
My code also does fixed-point calculations on CPU only and for 286 it takes the XL for the NPU to be faster then CPU (while providing all the extra accuracy of 80-bit floating point registers).

Reply 111 of 148, by feipoa

Posted on 2021-05-12, 01:58

feipoa Offline

Rank l33t++

Rank: l33t++
Posts: 10273
Joined: 2011-03-07, 13:54
Location: Canada

Nice work. Hopefully you will release these executables when you feel they are mostly complete.

I've heard that the Cyrix 287XL was substantially faster than the Intel 287 and your results clearly have demonstrated that (twice faster). I have one of those Cyrix 287XL's sitting in my bin, but it belongs to someone else. I'll probably benchmark it when I get started on 286 testing.

I also have the Weitek and that Cyrix hybrid EMC87-33. It would be nice to complete the 387 FPU results with these added one day.

Plan your life wisely, you'll be dead before you know it.

Reply 112 of 148, by pentiumspeed

Posted on 2021-05-12, 02:16

pentiumspeed Offline

Rank l33t

Rank: l33t
Posts: 3197
Joined: 2017-05-17, 23:17
Location: Great Northern: Canada.

If you have a compaq 486 25 or 33, or 386 i series (this is desk top computer), chances you have a motherboard that can take either cacheless 486 or 386 with 387 sockets, in 386 mode, this does have zero wait bus hosted by 16K cache controller called 395DX 33, the huge chipset. This 395DX, 386 socket and 387 socket is on zero wait state bus. The motorola video chipset on these compaq i desktop is fast as WD 90C31 because I benchmarked doom on it with 486DX2 66 and got around 17 fps if I recall correctly.

There is online PDF datasheet penned by Intel on 395DX cache controller. Search for it.

Benchmark that on this one.

The board looks like this:
https://www.ebay.ca/itm/142058297635?chn=ps&n … 3RoCB80QAvD_BwE

If anyone have a compaq deskpro M with 386DX 33 card. This also takes either weitek 3167 or 387, and also same 395DX 33 cache controller. Then also benchmark that one as well.

This is 25MHz card but exactly same card as 33Mhz and takes weitek 3167 socket.

https://www.ebay.ca/itm/143133465839?chn=ps&n … TBoCqRQQAvD_BwE

Cheers,

Great Northern aka Canada.

Reply 113 of 148, by pshipkov

Posted on 2021-05-12, 06:40

pshipkov Offline

Rank l33t

Rank: l33t
Posts: 2206
Joined: 2018-10-11, 05:08

Deunan wrote on 2021-04-27, 08:44:

Personally I see one big problem with the 4x4 operation - it needs all the data to be fed to FPU and then the results be read back. AFAIR you need to start with empty stack, use IIT-specific stack extension instructions, and some of the input arguments are over-written to store the result. This, coupled with the rather slow CPU-NPU comm channel, limits the usefulness of such instructions. Weitek worked around that by having their NPU register space memory-mapped, at the cost of even lower compatibility with typical x87 code.

Not sure i follow. 😀
In layman terms - to compute anything on 387 compatible FPU you need to populate its data registers from system memory anyway.
Compiler can hide that, or it can be exposed in code like the IIT's M4X4.

As for the function signature - standard accessors stuff.
It is preferable than returning by value.
Anyway.

@feipoa
About 287 FPUs. Perf goes in this order:
Cyrix CX-82s87-np-sv
Cyrix 287XL+
IIT 2c87
Intel 287XL
then the rest.

retro bits and bytes

Reply 114 of 148, by feipoa

Posted on 2021-05-12, 06:49

feipoa Offline

Rank l33t++

Rank: l33t++
Posts: 10273
Joined: 2011-03-07, 13:54
Location: Canada

Pardon my memory, it is the Cyrix CX-82s87-np-sv I have sitting in a bin (photo below). How much faster is the CX-82S87-np-sv compared to the Cyrix 287XL ?

The attachment Cyrix_FasMath_CX-82S87-NP-SV.JPG is no longer available

Plan your life wisely, you'll be dead before you know it.

Reply 115 of 148, by pshipkov

Posted on 2021-05-12, 07:01

pshipkov Offline

Rank l33t

Rank: l33t
Posts: 2206
Joined: 2018-10-11, 05:08

Nice, these are super rare.
Check this.

In my tests the perf diff compared to Cyrix 287XL+ is very small and does not show in simple benchmarks like NSI, NSSI, LM.
You will need something more serious to see it. Similar to @Deunan i use fractal rendering programs for that.

retro bits and bytes

Reply 116 of 148, by pshipkov

Posted on 2021-05-12, 07:03

pshipkov Offline

Rank l33t

Rank: l33t
Posts: 2206
Joined: 2018-10-11, 05:08

Hmm, i thought i posted Cyrix 286XL+ data there as well, but looking now - it is missing.
Will add at some point soon.

retro bits and bytes

Reply 117 of 148, by Deunan

Posted on 2021-05-12, 08:53

Deunan Offline

Rank l33t

Rank: l33t
Posts: 2088
Joined: 2018-05-29, 12:32

pshipkov wrote on 2021-05-12, 06:40:

In layman terms - to compute anything on 387 compatible FPU you need to populate its data registers from system memory anyway.
Compiler can hide that, or it can be exposed in code like the IIT's M4X4.

What I wanted to say is the 386 and earlier CPUs talk to their co-processors via I/O. Now, the DX has at least 32-bit bus for that but still, it takes at least 2 cycles for CPU to fetch each 32-bit word (assuming alignment and no wait states) and then a few more to transfer that to NPU. Rinse and repat for all the arguments. Loading the entire 4x4 matrix (usually using double-precision format) and the vector via this narrow pipe takes many cycles. In contrast the 486 FPU has access to internal cache and later SSE-enabled CPU just load and move around the NPU registers like any others.

There is a massive NPU performance difference on 386SX vs 386DX, exactly because it takes twice as long to transfer all the data, and these 4x4 operations would suffer even more. The faster the NPU the more impact this narrow path has.
So sure, the IIT is faster at matrix * vector multiplication with its extensions, I'm just pointing out that the way x87 was bolted to the rest of the system by Intel is seriously limiting any such inventions. IIT had great idea but in the end if you really wanted performance you'd pay for 486DX, or maybe think about Weitek if your software and mobo supported it.

And the fact that there doesn't seem to be much software support for the IIT seems to confirm my theory - that improvement came a bit late and wasn't really offering enough performance difference. Then Cyrix made a faster NPU in general and Intel made 486.

Reply 118 of 148, by pshipkov

Posted on 2021-05-12, 13:51

pshipkov Offline

Rank l33t

Rank: l33t
Posts: 2206
Joined: 2018-10-11, 05:08

Right on.

retro bits and bytes

Reply 119 of 148, by JohnBourno

Posted on 2022-03-22, 12:58

JohnBourno Offline

Rank Newbie

Rank: Newbie
Posts: 24
Joined: 2021-07-01, 19:48

Very interesting thread! I'm currently trying to optimize my overclocked 386 and noticed that the PCPlayer benchmarks is faster with my IIS FPU. So now I'm waiting for my FastMath to arrive..

Phido wrote on 2019-03-22, 01:37:

Which is what really caught the clone cpu makers out, FPU's up to that point were essentially only for the small professional market. Quake made it a requirement and highly valued a pipelined one.

What is interesting, is that without a FPU Duke Nukem 3d sometimes gets slowed down significantly. It drops from 20 frames per second to 7 fps in some scenes. Turns out that Duke3D uses floating point operations for rendering slopes. So without an FPU every time you have some slopes in your viewports the CPU has to do all the extra work and the game gets almost unplayable. Just by adding a IIS FPU the fps only drop to 15 instead of 7. So quite playable.

Main menu