VOGONS


Ideas about speeding up the dynrec

Topic actions

First post, by M-HT

User metadata
Rank Newbie
Rank
Newbie

Hi,

I have some ideas about speeding up the dynrec (both old (x86) and new (non-x86)).

Idea 1:
When translating the instruction stream and the maximum number of instructions is reached, the code block is closed and exited.
The idea is to link the block to the following block (something like when doing unconditional short/near jump).
Implementation for the non-x86 dynrec is in the attached file - decoder.h

Idea 2:
If an instruction uses immediate value (as an operand or in memory access) the value is not encoded in the translated code, but it's read from original instruction stream.
This helps the self-modifying code (SMC), because if only the immediate value is changed the code doesn't need to be translated again.
But, it's also slower than encoding the immediate value in the translated code because the generated code is (slightly) longer and there's one more memory access when executing the code.
The idea is to encode the immediate value in the translated code unless the immediate value was changed by the SMC - in this case the immediate value is read from original instruction stream (like it's now).
That means that when a code block is translated for the first time, the immediate value is encoded in the translated code. When the SMC changes the immediate value, the code block is translated again, but this time the immediate value is read from original instruction stream.

Implementation for the non-x86 dynrec (modified functions decode_fetchb_imm, decode_fetchw_imm and decode_fetchd_imm) is in the attached file - decoder_basic.h

I don't know if there are some disadvantages in these ideas (or my implementations), but if there are some, I would like to know them.

I also have a question related to idea 2 - in non-x86 dynrec, some instructions use function dyn_dop_word_imm (which reads immediate values from original instruction stream) and some instructions use function dyn_dop_word_imm_old (which encodes immediate values in the translated code) - the question is why ?

And lastly an observation - in the latest version of file core_dynrec.cpp (in cvs) a new definition was added - POWERPC, but it has the same value as ARMV4LE. Also, new include is referenced (core_dynrec/risc_ppc.h), which isn't in the cvs.

Attachments

  • Filename
    decoder.h
    File size
    18.23 KiB
    Downloads
    278 downloads
    File license
    Fair use/fair dealing exception
  • Filename
    decoder_basic.h
    File size
    42.94 KiB
    Downloads
    262 downloads
    File license
    Fair use/fair dealing exception

Reply 1 of 28, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

in non-x86 dynrec, some instructions use function dyn_dop_word_imm (which reads immediate values from original instruction stream) and some instructions use function dyn_dop_word_imm_old (which encodes immediate values in the translated code) - the question is why ?

Speed, the smc-aware functions are even heavier there so only those really
needed for a good speedup are enabled.

in the latest version of file core_dynrec.cpp (in cvs) a new definition was added - POWERPC, but it has the same value as ARMV4LE. Also, new include is referenced (core_dynrec/risc_ppc.h)

Just ignore those (POWERPC should have a different value though...) as the
implementation is not working and i haven't heard from the people since then.

Reply 2 of 28, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

When translating the instruction stream and the maximum number of instructions is reached, the code block is closed and exited.
The idea is to link the block to the following block (something like when doing unconditional short/near jump).

Implementation is fine, don't see any drawback in doing that.

If an instruction uses immediate value (as an operand or in memory access) the value is not encoded in the translated code, but it's read from original instruction stream.
This helps the self-modifying code (SMC), because if only the immediate value is changed the code doesn't need to be translated again.

Right, that's why a bunch of instructions have been modified to behave like
that namely those that are very often modified in a class of games (build games
like duke3d for example).

That means that when a code block is translated for the first time, the immediate value is encoded in the translated code. When the SMC changes the immediate value, the code block is translated again, but this time the immediate value is read from original instruction stream.

The code is ok as well, and it might improve speed but you'd have to actually
test that with some games (duke3d/blood, terminator games for example).

Reply 3 of 28, by kekko

User metadata
Rank Oldbie
Rank
Oldbie

Hi,
I have a question for wd about this point:

wd wrote:

When translating the instruction stream and the maximum number of instructions is reached, the code block is closed and exited.
The idea is to link the block to the following block (something like when doing unconditional short/near jump).

Implementation is fine, don't see any drawback in doing that.

Wasn't the block processing interruption at max_opcodes necessary for dosbox events handling? If I remember correctly, dosbox must quit dynamic core relatively often to return to the main loop and handle events; this emerged when we talked about improving short jumps handling, some time ago.

Reply 5 of 28, by M-HT

User metadata
Rank Newbie
Rank
Newbie
wd wrote:

That means that when a code block is translated for the first time, the immediate value is encoded in the translated code. When the SMC changes the immediate value, the code block is translated again, but this time the immediate value is read from original instruction stream.

The code is ok as well, and it might improve speed but you'd have to actually
test that with some games (duke3d/blood, terminator games for example).

Ok, I'll try to compile and test it with duke3d

Reply 6 of 28, by M-HT

User metadata
Rank Newbie
Rank
Newbie

I compiled (both ideas) and tested it with duke3d.
It's working and I don't see any slowdown or other negative aspects.

The attached file is my implementation for the x86 dynrec.

Attachments

  • Filename
    decoder.h
    File size
    82.18 KiB
    Downloads
    293 downloads
    File license
    Fair use/fair dealing exception

Reply 7 of 28, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

It's working and I don't see any slowdown or other negative aspects.

Same here, blood gets a small speedup.

Do you have any games in mind (arm recompiler) that benefit from the code?

Reply 8 of 28, by M-HT

User metadata
Rank Newbie
Rank
Newbie
wd wrote:

It's working and I don't see any slowdown or other negative aspects.

Same here, blood gets a small speedup.

Do you have any games in mind (arm recompiler) that benefit from the code?

I don't have any specific game in mind, but theoretically every (or most) 32-bit code should benefit from this, because every memory access with dword displacement is faster.
I tested it with the arm recompiler and I gained 3% more speedup against the simpler core (with version 0.73, with version 0.72 it was 5% more speedup).

Reply 9 of 28, by kekko

User metadata
Rank Oldbie
Rank
Oldbie

Nice work. I just tried quake and I got a nice +7 fps on timedemo demo1, but I had to disable sound because I have some problems with cvs lately (dosbox gives me the message "Exit to error: SB: 16bit irq pending")
Also duke3d got a noticeable speedup, but I just checked the frame counter in the corner, didn't make a real test.

Reply 10 of 28, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I don't get these high improvements like kekko, but +1 fps (quake, from 13 to 14fps)
is nice imo especially as there should not be any drawback.

Code should be

static bool decode_fetchb_imm(Bitu & val) {
if (decode.page.index<4096) {
if (decode.page.invmap != NULL) {
if (decode.page.invmap[decode.page.index] == 0) {
val=(Bit32u)decode_fetchb();
return false;
}
}
HostPt tlb_addr=get_tlb_read(decode.code);
if (tlb_addr) {
val=(Bitu)(tlb_addr+decode.code);
decode_increase_wmapmask(1);
decode.code++;
decode.page.index++;
return true;
}
}
val=(Bit32u)decode_fetchb();
return false;
}

though (if-level changed).

Reply 11 of 28, by HunterZ

User metadata
Rank l33t++
Rank
l33t++

Would be interested to see whether this would buy any FPS increase in SkyNET while running in VESA 640x480. I'm not sure what the bottleneck is, but the framerate is consistently low while in indoors environments/levels.

Reply 13 of 28, by kekko

User metadata
Rank Oldbie
Rank
Oldbie
wd wrote:

I don't get these high improvements like kekko

heh, the difference mostly depends on the rig... from those figures I guess you may need a serious upgrade, wd 😜
Anyway I ran quake timedemo demo1 in vid_mode 0 with sound and it went from 78fps on 0.73 official to 86fps. Quite nice indeed.

Reply 15 of 28, by M-HT

User metadata
Rank Newbie
Rank
Newbie
wd wrote:
Code should be […]
Show full quote

Code should be

static bool decode_fetchb_imm(Bitu & val) {
if (decode.page.index<4096) {
if (decode.page.invmap != NULL) {
if (decode.page.invmap[decode.page.index] == 0) {
val=(Bit32u)decode_fetchb();
return false;
}
}
HostPt tlb_addr=get_tlb_read(decode.code);
if (tlb_addr) {
val=(Bitu)(tlb_addr+decode.code);
decode_increase_wmapmask(1);
decode.code++;
decode.page.index++;
return true;
}
}
val=(Bit32u)decode_fetchb();
return false;
}

though (if-level changed).

No, this is wrong, because when the code block is translated and there weren't any changes due to SMC then decode.page.invmap is NULL and this code would read the immediate value from original code and not encode it in the translated instructions.

Reply 17 of 28, by M-HT

User metadata
Rank Newbie
Rank
Newbie

I would like an opinion on this idea:
In dynrec, when memory is accessed, a function is called. This function checks if the memory address can be accessed directly or with a read/write handler and then accesses the memory using appropriate method.
The idea is to call a different function. The function checks if the address can be accessed directly. If yes, then the function increases a counter for this memory access (and accesses the memory). If not, the function rewrites the code (where it was called from) to call the old memory access function (and of course accesses memory). If the counter reaches certain treshold (meaning that all addresses in this memory access until now were directly accessable), the function rewrites the code (where it was called from) to access the memory directly (and not to call a function).

This method can break some games (so it could only be optional), but the question is, does it make sense to implement this or is it a bad idea, because it breaks a lot of games ?

I know that some speed can be gained by inlining the (original) function, but I've done some testing on arm (not with the standard dynrec) and these are the results (less is better):

original function: 3m53.840s
inlined original function: 3m49.730s
new function with counter: 3m28.520s

BTW the treshold that I'm using is 1024

Reply 18 of 28, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

You're talking about mem_readb_checked_drc et al? I think that would break
if something does a sweep read of the memory say from address 0x1000:0
to 0xb000:0 (so includes regular memory and graphics memory),
which exceeds your limit.

Can you get rid of more than the call to get_tlb_read and the if (tlb_addr) ?
As that's what would be saved by that (?).

I'm not sure if there might be a problem with for example code that's used
in v86 mode and real mode, the 1024/whatever limit may be nicely safe.

Reply 19 of 28, by M-HT

User metadata
Rank Newbie
Rank
Newbie
wd wrote:
You're talking about mem_readb_checked_drc et al? I think that would break if something does a sweep read of the memory say from […]
Show full quote

You're talking about mem_readb_checked_drc et al? I think that would break
if something does a sweep read of the memory say from address 0x1000:0
to 0xb000:0 (so includes regular memory and graphics memory),
which exceeds your limit.

Yes, I'm talking those functions.
I know it breaks in similar cases, but when I'm thinking of standard program, it accesses global variables, stack variables, heap variables (all directly accessible), graphics memory and other memory mapped IO (all accessed with handler). So in standard program there shouldn't be any problem. Of course games often use non-standard things, but I think it's more the case for older games and not for newer ones, which need the speedup more.
That's why this method could only be optional (selectable in .conf).

wd wrote:

Can you get rid of more than the call to get_tlb_read and the if (tlb_addr) ?
As that's what would be saved by that (?).

When I compare call to original function and direct access to memory in translated code, I save call to the original function, exception check, check for direct access and sometimes also save state before call and restore state after call.
When I compare inlined original function and direct access to memory in translated code, I save compare instruction, not taken conditional branch, unconditional branch (to skip handler access code) and the translated code is shorter (half maybe). Either the branches are so expensive, or, because it's used so often, it just accumulates or, because the code is shorter, the processor cache is used better.

When doing word and dword accesses I also save the check for page boundary. In this case the new function checks for page boundary and if the access crosses the boundary, the counter is decreased. If the counter drops below the negative treshold, the function rewrites the code (where it was called from) to call the old memory access function (and accesses the memory). If the function rewrites the code to access the memory directly, the check for page boundary is not used.

wd wrote:

I'm not sure if there might be a problem with for example code that's used
in v86 mode and real mode, the 1024/whatever limit may be nicely safe.

I don't know what code might be problematic, that's why I want opinions about this method.
I chose 1024 as treshold value, because my test program crashed with 256 and worked with 512, so I doubled the value for safety. And since the program isn't doing anything that could cause the crash (I know because I wrote it), the culprit was probably the dos extender.