VOGONS


ARM (Thumb) Dynamic Core Code

Topic actions

First post, by Pickle

User metadata
Rank Member
Rank
Member

First im not the author of the ARM dynamic code, but i am the maintainer of the the GP2X version of dosbox.
A user M-HT on our GP2X forums has written the code for a ARM dynamic core.

This the latest discussion about the code: http://www.gp32x.com/board/index.php?showtopic=43316&st=0

This M-HT webpage with the file changes http://members.chello.sk/apauer/dosbox2/dosbox2.html

With the dynamic core theres about a 45%-50% increase in speed over the simple core. There are multiple versions of the dynamic core, in the latest version the code has a negative effect on x86 versions. So in other words the faster version of this core should only be used on ARM devces.
My hope is by posting this here other ARM devices might benefit from this work.

Reply 1 of 48, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I'll try to integrate it by times, but according to the thread it changes other
parts of dosbox to get the most speed out of it. Might be better to have it
separated (as done with the other backends) for the moment.
Multiple versions are no problem, choice is done by #defines or ./configure
(the latter preferred if it's possible to choose the correct one).

Reply 3 of 48, by Pickle

User metadata
Rank Member
Rank
Member

Yeah wd your correct the latest version and fastest at the moment doesnt work well with other dynamic cores according to M-HT. I havnt tried on a PC only the GP2X version.
But if you intregrate it in someday that would be great.

Reply 5 of 48, by M-HT

User metadata
Rank Newbie
Rank
Newbie

I rewrote the newer thumb recompiler to make it simple to integrate into official dosbox code.
The result can be seen on this page: http://members.chello.sk/apauer/dosbox3/dosbox3.html

Reply 6 of 48, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Thank you, seems fine.

If you're interested: crazyc has extended/modified the recompiler quite a lot,
yet mostly targeted at his mips backend. He's using the 0.71 dosbox sources as base,
nevertheless you might find some interesting ideas in it.
http://forums.ps2dev.org/viewtopic.php?t=3179 (0.71 patch).

Reply 9 of 48, by M-HT

User metadata
Rank Newbie
Rank
Newbie

I implemented the ideas and the result (and some explanation) can be seen on this page: http://members.chello.sk/apauer/dosbox5/dosbox5.html
I changed the recompiler to bring more speed to the ARM backend - the other backends should be working exactly as before.
Hopefully it can be integrated into cvs.
If you have some questions, feel free to ask.

Reply 10 of 48, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Thanks (what an awful load of changes to check 😉 ), but it seems to lack
some parts. Like the function dyn_pop_seg(Bit8u seg) can change the value
of the segment register (the recompiler does NOT emit a block finish on
that function) but it is not reflected into the chache hostreg you are using.
But i haven't checked it out in-depth yet.

Thanks for your work, maybe you can post some figures how much faster
it is (if it's noticeable for some games).

Reply 11 of 48, by M-HT

User metadata
Rank Newbie
Rank
Newbie
wd wrote:
Thanks (what an awful load of changes to check ;) ), but it seems to lack some parts. Like the function dyn_pop_seg(Bit8u seg) c […]
Show full quote

Thanks (what an awful load of changes to check 😉 ), but it seems to lack
some parts. Like the function dyn_pop_seg(Bit8u seg) can change the value
of the segment register (the recompiler does NOT emit a block finish on
that function) but it is not reflected into the chache hostreg you are using.
But i haven't checked it out in-depth yet.

Actually, in the hostreg(s), I keep the address of where the registers are stored in memory (&cpuregs and &Segs), so the value of the registers can be changed freely anywhere and anytime.

wd wrote:

Thanks for your work, maybe you can post some figures how much faster
it is (if it's noticeable for some games).

I benchmarked the speedup of dynamic core vs. simple core and these are the results for various versions of code emmiters.

Before the latest changes:
risc_armv4le-thumb-iw.h - ~49% speedup
risc_armv4le-thumb-niw.h - ~45% speedup
risc_armv4le-o3.h - ~29% speedup
risc_armv4le-s3.h - ~25% speedup
risc_armv4le-thumb.h - ~9% speedup

After the latest changes:
risc_armv4le-thumb-iw.h - ~55% speedup
risc_armv4le-thumb-niw.h - ~51% speedup
risc_armv4le-o3.h - ~38% speedup
risc_armv4le-s3.h - ~38% speedup
risc_armv4le-thumb.h - ~26% speedup

I expected a bit more speedup from the latest changes, but I think it's still worth it.

Reply 12 of 48, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I keep the address of where the registers are stored in memory (&cpuregs and &Segs)

Ok it looked like you'd use something like crazyc to use a bunch of registers
for the emulated registers. Don't know if it'd be better to go that way.
The files you've uploaded seem fine nevertheless.

Reply 13 of 48, by M-HT

User metadata
Rank Newbie
Rank
Newbie
wd wrote:

Ok it looked like you'd use something like crazyc to use a bunch of registers
for the emulated registers. Don't know if it'd be better to go that way.
The files you've uploaded seem fine nevertheless.

Well, ARM doesn't have enough free registers to hold the emulated registers, so I think what I did was the next best thing.
The reason is, that on ARM, to access memory (read/write) you must first put the memory address into register and then use the register (with possible small displacement) to access the memory. So instead of doing it in almost every emulated instruction which uses registers, I do it when entering the recompiled code, which leads to some speedup.

Reply 14 of 48, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

Ok i'll see if we can maybe get both things into it, selectable (x86 has a very
limited registerset, whereas for x64 there should be several more, so it might
be appealing to reg-cache those values).

Reply 15 of 48, by M-HT

User metadata
Rank Newbie
Rank
Newbie

x86 has two free registers - edi and ebp - both are preserved across calls. But x86 also has the original dynamic recompiler, which I think is faster than this one, so I see no need to optimize this one (but don't let it stop you).

x64 has more free registers - some are preserved accross calls, some are not. But x64 can access the memory directly (using memory address not only using register) unlike ARM, so the only difference would be shorter generated code. I'm not sure how much speed (if any) would be gained by using my changes on the x64 code emitter (the same is true for x86).

mips has free registers (crazypc is using them and more than two), but I don't know much about mips assembler to say whether my changes would help it or not.

Reply 16 of 48, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I was talking about the "use hostregs for emuregs" so mapping ~8 emulated
registers onto the two yet-free host registers. But as you say it could be done
the way you implemented it for arm.
The x86 drc is indeed superfluous but i don't want to abandon it yet as it
is the only thing i've got for testing (the x86 dynamic core is a lot faster but
not really portable).

Reply 17 of 48, by M-HT

User metadata
Rank Newbie
Rank
Newbie

Sorry, my mistake.

Actually I was considering your idea. One problem is, which registers to map onto host registers - my top choices were EAX, ESP and EBP. Another problem is, that whenever the registers are used outside the recompiled code, they have to be stored back to memory in case of reading and loaded back to host registers in case of writing - this seemed messy to maintain so I abandoned the idea (also my idea seemed cleaner).

Reply 18 of 48, by wd

User metadata
Rank DOSBox Author
Rank
DOSBox Author

that whenever the registers are used outside the recompiled code, they have to be stored back to memory in case of reading and loaded back to host registers in case of writing

Yes, but ideally these spots are rare, like when exiting a block and when
interfacing with a few things like io-ports (as they don't trigger a block-exit).
But doesn't matter much, i didn't bother with it yet, and your code should be fine.

Reply 19 of 48, by M-HT

User metadata
Rank Newbie
Rank
Newbie

Well, not that rare if you count the emulated instructions that acccess memory using those registers, instructions which use the registers implicitly and in case of ESP instructions that manipulate stack - push, pop, etc.
But like you said, it doesn't matter at the moment.