On the 4x4 instruction.. you load the 4x4 transformation matrix into alternate register sets, then leave them resident while you load the fairly small 4x1 vector. Typically the transformation matrix is reused many times, so this saves a huge amount of communication overhead. For the time, a very big win. You also presumably (I haven't tried it) get a pretty good chunk of time that the FPU is busy and you might be able to do something useful with the integer unit. Easier than trying to work in a couple int instructions in between normal FPU instructions.
Often you can get away with a 3x3 matrix for 3d graphics, which would be a bit faster, but they didn't support that.
A downside.. Operating systems don't know about it, so they don't save the extra register sets. It didn't matter much at the time.
Deunan wrote on 2021-04-27, 08:44:
Personally I see one big problem with the 4x4 operation - it needs all the data to be fed to FPU and then the results be read back. AFAIR you need to start with empty stack, use IIT-specific stack extension instructions, and some of the input arguments are over-written to store the result. This, coupled with the rather slow CPU-NPU comm channel, limits the usefulness of such instructions. Weitek worked around that by having their NPU register space memory-mapped, at the cost of even lower compatibility with typical x87 code.
Long story short: It took MMX to finaly have some direct access to FPU register space, and even that was flawed due to cost of switching between MMX and x87 modes. SSE finally made FPU on x86 family somewhat saner by today's standards.