I finally had some time last weekend to compare IIT 4C87DLC-40 against Cyrix CX-83D87-40. Test were done on Am386DX-40 system. I plan to also try Cx486DLC-40 as the CPU, and another IIT 3C87-40 sometime later - to see what, if any, differences there are.
For the benchmark I'm using my own programs that calculate fractals, because a) the usual benchmarks suck, b) I'm too lazy to look for proper cracks of old 3D Studio 4, and c) I also made versions that work on 286 systems, and non-PC machines like FM-Towns. Point is, my code is really using the x87 a lot and not just additions and multiplications. I had hoped this would bring out any differences between the chips and it really does so.
The Cyrix chip is considerably faster. I won't bother you with all the details but the most demanding benchmark (which is written in C so also stresses the CPU a bit more than the hand-crafted assembly) took 18m:18s.09 on Cyrix and 28m:49s.16 on IIT for 640x480 resolution. Note this code does not use any of the IIT extensions, I actually need to come up with something that would require matrix operations, or would greatly benefit from the multiple stack banks first - and one could argue that would be cherry picking. But I will do it, eventually.
In case someone wants to have a point of reference for their own systems, I've also run Quake 1.06 demo from the DOSBENCH package at 320x200 using ISA Trident TVGA8900D with 1MiB of VRAM (for 32-bit internal bus operation). Only 8MiB of RAM though but using CF card. This test was done earlier, however, at 33MHz because I was also testing a Ti486DLC-33 CPU.
IIT score on 386DX-33: 969 frames 717.7 seconds 1.4 fps
IIT score on 486DLC-33: 969 frames 585.8 seconds 1.7 fps
Cyrix score on 386DX-33: 969 frames 602.0 seconds 1.6 fps
Cyrix score on 486DLC-33: 969 frames 474.8 seconds 2.0 fps
And it might also be of interest that a 16-bit version of my benchmark, run in 320x200, took 87m:21s.044 on 12MHz 80286 with Intel D80287-10 - but only 41m:40s.592 with Intel C80287XL. The XL chip is not only faster, it also runs with 1:2 clock divider (so effectively at the same speed as CPU) instead of the usual 1:3 in sync mode. And actually being a 387SX core in 40-pin package it can also do some things easier (like calculate sin(x) direcly, or accept wider range of input parameters) so in that mode it takes 33m:51s.858
My code also does fixed-point calculations on CPU only and for 286 it takes the XL for the NPU to be faster then CPU (while providing all the extra accuracy of 80-bit floating point registers).