VOGONS


First post, by Ailicec

User metadata
Rank Newbie
Rank
Newbie

I'm interested in this from an optimization point of view, but thought the emulation crowd would also find it useful for cycle accurate emulation, and might know about it already. It may also apply to some extent to the 286 and 486.

When the 386 encounters a coprocessor instruction, it hands it to the 387. Then, the 386 should be able to execute integer instructions without waiting for the 387 to finish. This leads to writing interleaved 386/387 code which could/should be faster than non-interleaved.

Unfortunately, I've never found much detail on how much overlap there really is. For example, take FDIV, which is about 80 cycles on the 387. How many cycles is the 386 preoccupied with this before it can do something else? If it's 10, there's 70 cycles of opportunity, if it's 60, it isn't worth the trouble. The manual also says the 387 can accept "certain" instructions while it is still busy. That's good, but which ones..

I've been through the hardware and software reference manuals and they provide little past what I've summarized - just that concurrent execution is possible. The hardware manual does give how many bus cycles are needed for memory operand transfers - 2 to 5 (since a bus cycle is two clocks, that's 4-10).

I attempted some measurements by doing something like a FMUL or FDIV followed by NOPs, but the results were inconclusive. (Short story, some NOPs seemed "free" but it didn't work like you'd expect, ie adding a NOP might actually go faster.) Also, it seemed that my Intel and IIT chips had different behavior in this regard.

The 287 is probably similar. The 486 manuals give a bit more info - of course the FPU is integrated, but there's still a concurrency possibility, and the manual gives some hints on how tied up the integer unit is. On some instructions, like FSIN, apparently the integer unit is completely busy helping the FPU, while on others it is available after a few cycles.

I'm glad for any help!

Reply 1 of 7, by idspispopd

User metadata
Rank Oldbie
Rank
Oldbie

I remember reading something about the Cyrix coprocessors. Apparently Cyrix optimized the 83D87 so much it didn't really matter any more because the CPU couldn't feed it any faster, which is why the made the 387+ which is nearly as fast but cheaper to produce (and allows higher clocks).

Some interesting points from http://wiretap.area.com/Gopher/Library/Techdo … /Cpu/coproc.txt

"Unlike the Intel 387DX, the 83D87 (and all other 387-compatible chips as well) does not support asynchronous operation of CPU and coprocessor.
,,,
The traditional 387 CPU-coprocessor interface via IO ports has an overhead of about 14-20 clock cycles. Since the Cyrix 83D87 executes some operations like addition and multiplication in much less time, its performance is actually limited by the CPU-coprocessor interface."

Reply 2 of 7, by Ailicec

User metadata
Rank Newbie
Rank
Newbie

The asynchronous terminology usually refers to the fact that the Intel 387s could be run at a different speed than the 386. They can run within 10/16 to 16/10 the 386 clock speed. This feature was hardly used at all. The clones probably simplified their design a lot by skipping it.

I've seen similar numbers for communication overhead elsewhere too.. Wonder where it came from, or if it's accurate. A book on google books stated it as 15. (Wish I knew where they got their info..) http://books.google.com/books?id=-ZnJTTQ3-NYC … verhead&f=false The preview skips a potentially juicy page on the 387 as well as the bibliography.

An aside, another rarely used feature, hinted at in the hardware manual, is that there is some minimal support for multiple 387s with one 386. There's a 387 chip enable signal called STEN. I think you could could use an IO port to switch between several 387s. As cool as it would be, the overhead stated above, plus more overhead for switching between them, would make the idea a non-starter. Too bad..

Reply 3 of 7, by Harekiet

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I always assumed that the 386 can happily continue working while the 387 does it's thing and signals the 386 that it's result is ready. Programmers would have to use fwait if it wanted to wait on the last fpu operation instead. 386 keeping track internally of the state of the last fpu operation to finish writing the result to memory when finished.

Reply 4 of 7, by Ailicec

User metadata
Rank Newbie
Rank
Newbie

The coordination is automatic (unlike the 8087, which needed those FWAITs or bad things would happen). But its up to the programmer to not give two float ops in a row.

I reran my tests from last year. CPU is an AMD-40 (running at 33 mhz). These tests look like this (forgive my rather poor assembly skills, pointers are welcome). "WORKOP" is ""fdivp;" for these tests. The number of "NOP"s varies by line.

        asm (	"movl $1000000, %ecx;"  // 1 million
"fldl JunkConst;"
"fldl JunkConst;"
".align 4;"
"looptop2:"

WORKOP
"nop;nop;" (two here, varies from 0 to 10)
WORKOP
"nop;nop;"
WORKOP
"nop;nop;"
WORKOP
"nop;nop;"

"dec %ecx;"
"jnz looptop2;"
);

Intel 387DX:
fop Loop 0 nop (ms):7957.840332
fop Loop 1 nop (ms):8323.036133
fop Loop 2 nop (ms):8050.189453
fop Loop 3 nop (ms):7990.254883
fop Loop 4 nop (ms):8139.306152
fop Loop 5 nop (ms):8078.847168
fop Loop 10 nop (ms):8230.615234

IIT 4c87
fop Loop 0 nop (ms):3148.498047
fop Loop 1 nop (ms):3540.601074
fop Loop 2 nop (ms):3721.844971
fop Loop 3 nop (ms):3784.253418
fop Loop 4 nop (ms):4296.504395
fop Loop 5 nop (ms):4599.374023
fop Loop 10 nop (ms):6324.138184

Conclusions: On the Intel, it looks you get a couple of integer instructions for free after the FDIVP. But the progression is not very linear, sometimes an extra NOP actually speeds it up. Maybe its an alignment thing.
The IIT appears to be a lot faster, and the NOPs are not "free", suggesting the FPU turns the result around as fast as the CPU can request them, ie bottlenecked on the FPU-CPU communication. However I find it surprising that there's not at least some dead time after an FDIVP. The only data sheet I have, for a 3x87, gives 44 clocks for FDIV. There really ought to provide some dead time in there where the nops can fit for free.

Strangly, if I run Quake, which is the closest thing I have to a FPU heavy program, the IIT is slightly slower (maybe 0.5%) than Intel. Hmm. The IIT ought to blow the 387 away on paper - most ops are two to four times faster - but it actually ends up being slightly slower.

Reply 5 of 7, by idspispopd

User metadata
Rank Oldbie
Rank
Oldbie
Ailicec wrote:

Strangly, if I run Quake, which is the closest thing I have to a FPU heavy program, the IIT is slightly slower (maybe 0.5%) than Intel. Hmm. The IIT ought to blow the 387 away on paper - most ops are two to four times faster - but it actually ends up being slightly slower.

In that case you might want to have a look at the chapters about Quake in Michael Abrash's Black Book, especially chapter 63
http://www.phatcode.net/res/224/files/html/ch63/63-01.html
http://downloads.gamedev.net/pdf/gpbb/gpbb63.pdf
https://github.com/jagregory/abrash-black-book/releases (epub/mobi)

Perhaps it's FXCH that makes the difference?

Reply 6 of 7, by Ailicec

User metadata
Rank Newbie
Rank
Newbie

The Pentium is much better documented on the overlap, latency and throughput. The FXCH trick is basically that you can start an op like FMUL, swap it out of the way using FXCH during the same cycle, and start a new FMUL the next cycle. It gets around the x87's need to use the top of the stack, which got in the way of pipelining.

I'd like to see the same type of info on the 386/387. At this point I'm thinking there's aren't any cycle-accurate emulators for them, because there's no info floating around on what to do. Given that the floating point ops take varying cycles, and there were several clone FPU makers, that's about pointless anyway.
I reran my experiments this weekend after fixing some problems. I put the results in terms of how many "nops" you can put after various floating points ops before the loop runs slower.

Intel 387DX:
fadd: 2
fmul: 2
fdiv: 18
fsqrt: 26

IIT 4c87
fadd: 3
fmul: 3
fdiv: 5
fsqrt: 15

Reply 7 of 7, by idspispopd

User metadata
Rank Oldbie
Rank
Oldbie
Ailicec wrote:

The Pentium is much better documented on the overlap, latency and throughput. The FXCH trick is basically that you can start an op like FMUL, swap it out of the way using FXCH during the same cycle, and start a new FMUL the next cycle. It gets around the x87's need to use the top of the stack, which got in the way of pipelining.

No need to tell me, I know that's exactly what Abrash exploited in Quake, and that's why Quake ran great on Pentium/PPro/PII but slower on AMD and Cyrix CPUs.
What I wanted to say is that you could try to bench the code examples in that chapter with the Intel and IIT FPUs to see if the IIT is slower here.