VOGONS


Questions about fpu

Topic actions

  • This topic is locked. You cannot reply or edit posts.

First post, by kekko

User metadata
Rank Member
Rank
Member

Recently I noticed that the fpu emu is not integrated into
dynamic core; I'd like to mess around with the code and
need some tips on doing a dynamic fpu, just something
like a list of the steps of the implementation.
there's anyone that can help me?
thanks in advance!

Reply 1 of 30, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

well an asm fpu core is present in the cvs.

but it's not integrated in the dynamic cpu.

The dynamic cpu is a complex thing.
It will take a while before you understand how it works. let alone add your own generated code to it.

Water flows down the stream
How to ask questions the smart way!

Reply 2 of 30, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

The dynamic core does not handle fpu opcodes, but closes
the block and calls the normal core.

When integrating the fpu into the dynamic core, you can
use the stuff from fpu_instructions_x86.h. I'd stick to
that layout at first because it does not interfere with
other code in dosbox (not the emulated program) that uses
the fpu. Likely you won't get significantly tighter code
because of the stack layout of the fpu (at least that
was the point where i gave up the idea of an almost 1:1
translation where you just execute the fpu opcodes in a
row and nothing more).

Having games like Quake and Carmageddon in mind that make
heavy use of the fpu and are rather slow with the dynamic
core compared to the normal core, this could indeed be an
improvement 😀

wd

Reply 3 of 30, by kekko

User metadata
Rank Member
Rank
Member

Thanks guys.
The idea of a dynamic fpu started when I tried some of my
old code, a 3d software rasterizer which uses linear and
perspective corrections for the triangles in 3d space.
the linear functions uses 16:16 fixed point math, and with
my xp2600+ I can push dosbox at over 70k cycles with it.
the perspective corrected one uses asm written floating point
routines but is A LOT slower (too much compared to the
difference between the two techniques on a real cpu)

So I was wondering why was so slow and looking at the code
I noticed that the fpu was not dynamically recompiled and
decided to play with the code and add this feature.

Any help will be highly appreciated!

Reply 4 of 30, by swaaye

User metadata
Rank Moderator
Rank
Moderator

I posted a thread the other day about dynamic core's behaviour in many of the end-of-the-road DOS games, like Dark Forces and X-Wing, etc. They run more consistently and better overall with Normal core for me, on a Athlon64. Dark Forces in particular, with Gulikoza's Feb05 CVS builds, runs a lot better with Normal core.

So, this must mean Dark Forces uses the FPU a lot. Too bad x87 FPU is so awful.....it must be very hard to emulate it well.. Does emulation of this benefit from SIMD? I mean, even Athlon's amazing x87 FPU isn't really all that hot and emulating x87 down to a real x87 sounds icky 😀

edit: I just saw that Qbix replied to my thread saying it may be self modifying code.

Reply 5 of 30, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Use the following (don't forget to include fpu.h) in decode.h:

static void dyn_fpu_esc0(void) {
dyn_get_modrm();
if (decode.modrm.val >= 0xc0) {
gen_call_function((void*)&FPU_ESC0_Normal,"%Id",decode.modrm.val);
} else {
dyn_fill_ea();
gen_call_function((void*)&FPU_ESC0_EA,"%Id%Dd",decode.modrm.val,DREG(EA));
gen_releasereg(DREG(EA));
}
}

And a
case 0xd8:dyn_fpu_esc0();break;
in CreateCacheBlock().
This should work fine for opcodes 0xd8 to 0xde, as they don't
use any registers (no mapping to dynamic regs needed), 0xdf
(FPU_ESC7_Normal) uses ax so it doesn't work that way.

Then you can try to replace the gen_call_functions with code
that directly generates the stuff from fpu_instructions_x86.h

wd

Reply 6 of 30, by kekko

User metadata
Rank Member
Rank
Member

ok, completed a first try with wd precious tips:

#define DYN_FPU_ESC(code) {														\
dyn_get_modrm();
if (decode.modrm.val >= 0xc0) { \
gen_call_function((void*)&FPU_ESC ## code ## _Normal,"%Id",decode.modrm.val); \
} else { \
dyn_fill_ea(); \
gen_call_function((void*)&FPU_ESC ## code ## _EA,"%Id%Dd",decode.modrm.val,DREG(EA)); \
gen_releasereg(DREG(EA)); \
} \
}

for every opcode from d8->de:

		case 0xd8:
DYN_FPU_ESC(0);
break;

I got a noticeable speed gain on a pair of '95-'96 games (in the
order of 1/4 > 1/5). Now let's start work!
Thanks again to wd.

Reply 7 of 30, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Just checked Carmageddon, as they use the 80bit fpucodes
to render the screen. When including 0xdf in the dynamic
core it gets from about a half frame per second to fluent
speed (checked the videos, gameplay was much faster but
not too playable).

Let us know about any progress 😀

wd

Reply 9 of 30, by swaaye

User metadata
Rank Moderator
Rank
Moderator

Dark Forces seems particularly demanding. The game uses full 3D objects. I've read it has two 3D engines, one for the environment and another for these 3D objects so I'd imagine it hits the FPU. It really does not benefit at all from dynamic core in the current CVS.

There's a demo.
http://lucasarts.com/products/darkforces/splash.htm

Reply 10 of 30, by kekko

User metadata
Rank Member
Rank
Member

need a hand.
i've substituted:

		case 0xd8:
dyn_get_modrm();
if (decode.modrm.val >= 0xc0) {
gen_call_function((void*)&FPU_ESC0_Normal,"%Id",decode.modrm.val);
} else {
dyn_fill_ea();
gen_call_function((void*)&FPU_ESC0_EA,"%Id%Dd",decode.modrm.val,DREG(EA));
gen_releasereg(DREG(EA));
}
break;

with:

		case 0xd8:
dyn_get_modrm();
if (decode.modrm.val >= 0xc0) {
Bitu group=(decode.modrm.val >> 3) & 7;
Bitu sub=(decode.modrm.val & 7);
switch (group){
case 0x00: /* FADD ST,STi */
gen_call_function((void*)&FPU_FADD,"%Id%Id",TOP,STV(sub));
break;
....
}
} else {
dyn_fill_ea();
gen_call_function((void*)&FPU_FLD_F32,"%Dd%Id",DREG(EA),8);
Bitu group=(decode.modrm.val >> 3) & 7;
Bitu sub=(decode.modrm.val & 7);
switch (group){
case 0x00: /* FADD ST,STi */
gen_call_function((void*)&FPU_FADD,"%Id%Id",TOP,8);
break;
...
}
gen_releasereg(DREG(EA));
}
break;

this should be transition code toward native code but i get some errors.
there's something I should know about using gen_call_function?

Reply 12 of 30, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Don't know what you mean by your last posting (is the problem
resolved), otherwise:

I think you're mixing recompilation-time variables and execution-time
variables, the fpu top of stack pointer changes when executing
the recompiled code, so you can't use immediates (the %Id) here.

Maybe the following helps (don't know if it works):
gen_protectflags();
gen_load_host(&TOP,DREG(TMPB),4);
gen_dop_word(DOP_MOV,true,DREG(TMPW),DREG(TMPB));
gen_dop_word_imm(DOP_ADD,true,DREG(TMPW),decode.modrm.rm);
gen_dop_word_imm(DOP_AND,true,DREG(STACK),7);
gen_call_function((void*)&FPU_FADD,"%Dd%Dd",DREG(TMPB),DREG(TMPW));

Loads fpu.top, generates STV(sub==decode.modrm.rm) and
calls FPU_FADD (gen_releaseregs as needed).

@swaaye: if the game uses fpu-escapecodes very often, the
dditions from above should give a decent speed increase,
just add it to decode.h

wd

Reply 13 of 30, by kekko

User metadata
Rank Member
Rank
Member

ok i'll explain better.
I had problems when mixed code:

case d8:
dyn_get_modrm();
if (decode.modrm.val >= 0xc0) {
Bitu group=(decode.modrm.val >> 3) & 7;
Bitu sub=(decode.modrm.val & 7);
switch (group){
case 0x00: /* FADD ST,STi */
gen_call_function((void*)&FPU_FADD,"%Id%Id",TOP,STV(sub));
break;
......
case d9:
DYN_FPU_ESC(1);
break;

then I tried to leave only d8 (and let d9->df go with normal code).
and it works (?).
now i translated d9:

	dyn_get_modrm(); 
if (decode.modrm.val >= 0xc0) {
Bitu group=(decode.modrm.val >> 3) & 7;
Bitu sub=(decode.modrm.val & 7);
switch (group){
case 0x00: /* FLD STi */
{
//Bitu reg_from=STV(sub);
gen_call_function((void*)&FPU_PREP_PUSH,"");
gen_call_function((void*)&FPU_FST,"%Id%Id",STV(sub), TOP);
break;
...

but now I get this error:

Exit to error: CPU_SetSegGeneral: Stack segment with invalid privileges

your tips didn't help
i'm attaching the header i'm working on.

EDIT: removed attachment. lots of errors in it. thanks again wd!

Last edited by kekko on 2005-04-12, 11:17. Edited 2 times in total.

Reply 14 of 30, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

You can't use immediates in the code, i don't think the FADD
you posted works in all cases (try some fpu test programs).

Like in the 2nd case, the changes of FPU_PREP_PUSH
(which modifies the fpu.top) won't have any effect because
FPU_FST always uses the value of fpu.top it had when
the dynamic code was generated.

Will check the attached fille when i come home.

wd

Reply 16 of 30, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Didn't find anything obvious, and the operation on the
fpu is correct according to fputest. The x86-fpu modifies
ebx (and eax), so you could save them across the call
(cache_addb(0x53); and cache_addb(0x5b); for ebx
i think).
Or try the normal fpu, maybe regs are preserved there
or it behaves different.

Also you shouldn't forget to release the TMPB/TMPWs
when they are no longer used (end of an opcode e.g.)

Which game(s) did you use to test?

wd

Reply 17 of 30, by kekko

User metadata
Rank Member
Rank
Member

hi wd, thanks for replying. I'll try to push/pop regs
and see what will happen.
mainly the game i'm using is a tomb raider 1 demo
http://www.tombraiderchronicles.com/tr1/demo.html
with high detail (uses f.p. perspective texturing)
i'm testing one opcode at a time (the others go with
the normal core)

Reply 18 of 30, by kekko

User metadata
Rank Member
Rank
Member

No way.
I'm trying to improve my knowledge about recompiler.
Any help (docs, tips) will be appreciated as always.
Meanwhile, the first code (by wd) seems stable and
quite fast, so it could be added to cvs if you agree.

decoder.h:

#include "fpu.h"
#define DYN_FPU_ESC(code) { \
dyn_get_modrm(); \
if (decode.modrm.val >= 0xc0) { \
gen_call_function((void*)&FPU_ESC ## code ## _Normal,"%Id",decode.modrm.val); \
} else { \
dyn_fill_ea(); \
gen_call_function((void*)&FPU_ESC ## code ## _EA,"%Id%Dd",decode.modrm.val,DREG(EA)); \
gen_releasereg(DREG(EA)); \
} \
}

decoder.h (CreateCacheBlock):

		case 0xd8:
DYN_FPU_ESC(0);
break;
case 0xd9:
DYN_FPU_ESC(1);
break;
case 0xda:
DYN_FPU_ESC(2);
break;
case 0xdb:
DYN_FPU_ESC(3);
break;
case 0xdc:
DYN_FPU_ESC(4);
break;
case 0xdd:
DYN_FPU_ESC(5);
break;
case 0xde:
DYN_FPU_ESC(6);
break;
case 0xdf:
DYN_FPU_ESC(7);
break;