VOGONS


Boxedwine update

Topic actions

First post, by danoon

User metadata
Rank Member
Rank
Member

I created my first github release. It was hard to draw a line somewhere, there is always the next thing I want to work on and testing for a release just isn't as much fun 😀

https://github.com/danoon2/Boxedwine/releases/tag/25R1.0

simple web demo: http://boxedwine.org/v/25R1/test.html

Just to recap

Boxedwine is an emulator that runs 32-bit Windows applications. It achieves this by running a 32-bit version of Wine, and emulating the Linux kernel and CPU. It is written in C++ with SDL and is supported on multiple platforms.

Boxedwine is open source and released under the terms of the GNU General Public License v2 (GPL).

It's been a long time since I posted here with binaries for people to try out. I think last time someone mentioned Nitemare 3D (1994) and since then its been one of the games I test with.

If you have your favorite Windows 3.1 or Windows 95/98 game you would like to try, I would appreciate the feedback.

The last 2 years have been filled with pretty big refactorings. Previously to generate fast code, each emulated Linux process would have a single offset into 4GB of host memory address space. This allowed emulated memory instructions to be translated into just 3 or less new instructions as part of the binary translator. It was super fast. But starting in Wine 6, they leaned heavily into shared memory. At first I resisted and tried to just generate host hardware exception to catch this, but it became a mess and really slowed things down. Now it looks more like the Dosbox mmu where I have quick offset lookups for pages that allow direct reads/writes and if that fails I exit the binary translator and run Boxedwine C++ code that calls the correct page handler. This means Linux emulated shared memory doesn't need to exit the binary translator code. But instead of 1-3 generated asm instruction for each emulated memory instruction, it will now generate 10-20 with multiple reads into tables.

I also replaced my custom winex11.drv for Wine. I used to have my own version, so each time Wine would change theirs I would have to change mine. This was fine for Wine up to about Wine 5. After that they really accelerated the amount of change they pushed into winex11.drv. This acceleration was due to vulkan and wayland support, plus just normal changes. I decided to drop my custom winex11.drv and instead emulate X11. It took some time, but it wasn't as hard as I would have thought. I didn't have to implement or even allow a window manager, Wine has an option to decorate the Windows themselves so I just used that.

One of the more fun apps I finally got running was Cinebench 11.5 (2010). It was the last Cinebench to support Win32. The fun part is that it runs well on my main Windows machine so I can see how Boxedwine's binary translator performs compared to running directly. On my Intel Core i7-14700 with 28 threads, my machine got a score of 52.2, and Boxedwine on that machine got a 6.40. I believe the scores are linear, so I'm pretty happy that Boxedwine is running at 12% of the host speed for this test.

https://github.com/danoon2/Boxedwine

Reply 1 of 4, by AdriftWithAdvic

User metadata
Rank Newbie
Rank
Newbie

Hey, just wanted to say cool project! I'm curious about how it works so well and would like to learn more about it. If you're interested in discussing this further, here's my email: floatingadvice152@gmail.com.

Reply 2 of 4, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie
danoon wrote on Yesterday, 19:23:

At first I resisted and tried to just generate host hardware exception to catch this, but it became a mess and really slowed things down. Now it looks more like the Dosbox mmu where I have quick offset lookups for pages that allow direct reads/writes and if that fails I exit the binary translator and run Boxedwine C++ code that calls the correct page handler.

The way DOSBox does memory accesses always bothered me. Ignoring the fact that the page-fault handling is wrong by design (and so fragile enough to break in many circumstances), having to perform page-access checks for every single access is very sub-optimal compared to just catching an exception when things go wrong. But thanks to MS using a terrible unwinding method in their 64-bit ABI, exceptions are basically uncatchable when they occur in dynamically generated code.

Reply 3 of 4, by danoon

User metadata
Rank Member
Rank
Member
jmarsh wrote on Today, 20:52:

The way DOSBox does memory accesses always bothered me. Ignoring the fact that the page-fault handling is wrong by design (and so fragile enough to break in many circumstances), having to perform page-access checks for every single access is very sub-optimal compared to just catching an exception when things go wrong. But thanks to MS using a terrible unwinding method in their 64-bit ABI, exceptions are basically uncatchable when they occur in dynamically generated code.

By uncatchable, do you mean you can't wrap running dynamic code call in a try/catch? I haven't tried that.

For Boxedwine, in the exception handler I can check if the host instruction pointer was in generated code, if its not then it might be a c++ try/catch, in which case I let it continue. If its in generated code, I know I only allow 2 places to generate exceptions, jumping to new code and emulated memory read/write. In both those cases I will update the current emulated instruction pointer to be correct in the generated code before the exception can happen, that way the boxedwine code is in a valid state during the exception and will just run the current emulated instruction manually (normal non JIT cpu core) in the exception handler on Windows (Linux is a tiny bit different). Then it will return back to the dynamic code for the next emulated instruction.

For emulated read/write memory instructions, my goal was to do it in a way that doesn't touch flags on x64. Saving and restoring flags, just so I can do a cmp is really expensive. There are 2 parts to this, the page permissions and making sure it doesn't cross a page boundary.

For the page permission check, I read the host memory offset for the emulated page from a table and do a test read on it and if its null it will throw an exception.

For the page boundary check, I have an optimization for systems that support 4K pages, on those systems it will allocate every other page on the host so that a read/write that crosses a boundary will throw an exception. So for the x64 binary translator, there are no branches for emulated read/write instructions and thus no flag saving. This 4K page optimization gave about a 20% performance boost for quake 2.

On the 16K page sized Mac M chips, since this is Arm64, I didn't use hardware flags, so its not as expensive to do a branch to figure out a page crossing read/write. The branching / if statement in the generated code seems to be close to free since its almost always false and the M chips seem to have good branch prediction.

https://github.com/danoon2/Boxedwine

Reply 4 of 4, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie
danoon wrote on Today, 21:40:

By uncatchable, do you mean you can't wrap running dynamic code call in a try/catch? I haven't tried that.

Not on 64-bit windows. The way unwinding is performed, all functions must have annotated instructions (stored in a separate segment of the binary) that describe how to unwind the stack for that function. If that metadata is missing, the OS gives up trying to unwind/find a handler and just terminates the app. There are functions to "inject" unwind metadata for dynamically generated code but they're so cumbersome to use that it isn't worth the effort.