First post, by Ailicec
I'm interested in this from an optimization point of view, but thought the emulation crowd would also find it useful for cycle accurate emulation, and might know about it already. It may also apply to some extent to the 286 and 486.
When the 386 encounters a coprocessor instruction, it hands it to the 387. Then, the 386 should be able to execute integer instructions without waiting for the 387 to finish. This leads to writing interleaved 386/387 code which could/should be faster than non-interleaved.
Unfortunately, I've never found much detail on how much overlap there really is. For example, take FDIV, which is about 80 cycles on the 387. How many cycles is the 386 preoccupied with this before it can do something else? If it's 10, there's 70 cycles of opportunity, if it's 60, it isn't worth the trouble. The manual also says the 387 can accept "certain" instructions while it is still busy. That's good, but which ones..
I've been through the hardware and software reference manuals and they provide little past what I've summarized - just that concurrent execution is possible. The hardware manual does give how many bus cycles are needed for memory operand transfers - 2 to 5 (since a bus cycle is two clocks, that's 4-10).
I attempted some measurements by doing something like a FMUL or FDIV followed by NOPs, but the results were inconclusive. (Short story, some NOPs seemed "free" but it didn't work like you'd expect, ie adding a NOP might actually go faster.) Also, it seemed that my Intel and IIT chips had different behavior in this regard.
The 287 is probably similar. The 486 manuals give a bit more info - of course the FPU is integrated, but there's still a concurrency possibility, and the manual gives some hints on how tied up the integer unit is. On some instructions, like FSIN, apparently the integer unit is completely busy helping the FPU, while on others it is available after a few cycles.
I'm glad for any help!