In other news, I sat down with one of the new ARM64-based M1 Apple Macbooks and got DOSBox-X to run on those too. There are some considerations to make for those though, which I thought I'd share so SVN can compile for them too.
One is how to compile with SDL2. You're going to need to modify configure.ac and then autogen.sh it, because their configure script assumes that Darwin and ARM means iOS (that you're compiling for the iPad or iPhone), which is wrong. Remove the part of the case statement that tries to match *arm*darwin* leaving only the *ios* part.
The other is how to get dynamic core (dynrec) to work on ARM64. In it's default state, dynamic core will crash because ARM-based Mac OS X enforces a policy known as "write-xor-execute", meaning that mprotect() will allow you to map read + execute, or read + write, but not read + write + execute. DOSBox-X was able to work around this by attempting the read+write+execute first (as SVN does), and if that fails, does two mprotect() tests to determine whether it's because of "write-xor-execute" or because it's flat out not allowed. If its W^X, then it sets a flag and maps it read+write for the initial dynamic core work before then mapping read+execute. From that point on, whenever it needs to add a cache block, it remembers that W^X is in effect and mprotects it back to read+write during the modification before mapping it back to read+execute. With that modification, dynrec works on the Macbook just fine.
I recall, though I can't confirm, that ARM64-based Linux distributions for the Raspberry Pi have the same W^X policy.
Hopefully this information will help DOSBox SVN improve itself for these new environments.
I'm well aware x86 builds also run on the M1 Macbooks (as demonstrated by LGR on Twitter), but still, there's better performance to be had as native ARM OS X code.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.
Quick 'thank you' for this, Jon and the other contributors.
Neither VirtualBox or VMPlayer would let me install Win98SE in a virtual machine on my Ryzen PC - both were rather faster but both had illegal operation / invalid page fault failures in Regsvr32 (and something else in VirtualBox) so didn't actually complete the install.
I've found the only way to install and run Windows 95 and Windows 98 in VirtualBox without crashing is to turn OFF the CPU-based virtualization extensions, forcing VirtualBox to use software emulation. VT-x, VirtualBox, and Windows 95/98 don't mix for some reason.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.
Which compiler do you use for building? If it works on a Pentium 4, it’s likely built for SSE2 (-mfpmath=sse in gcc, might be set by default in recent mingw releases).
If you compile with build-mingw-lowend, does it help?
That script was originally designed to enable compiling on lower end systems by disabling the MT32 emulation (which tended to use SSE instructions).
There is code in src/gui/render.cpp to conditionally use SSE2 to speed up previous/current frame comparisons, but that should be conditional on CPUID reporting SSE2.
Can you use a debugger to point at the code that is faulting on Pentium III systems.
I've never really tested DOSBox-X on anything below a Pentium 4 though.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.
This typecast is optimized out by VS2019's compiler in Release builds, causing only the low 32 bits to be written in any case and a crash when executing the dynamically generated code.
Here is a hint about x86_64 builds running in Apple M1:
"The system prevents you from mixing arm64 code and x86_64 code in the same process. Rosetta translation applies to an entire process, including all code modules that the process loads dynamically."
This typecast is optimized out by VS2019's compiler in Release builds, causing only the low 32 bits to be written in any case and a crash when executing the dynamically generated code. DOSBox-X fixes the issue by explicitly masking the pointer by 0xFFFFFFFF for that test instead of relying on typecasting.
Casting to use sign/zero-extension is always preferable over masking because it puts the result in a different register (eliminating a move) to setup the comparison.
I've checked that code with values > 4GB before (by forcing the base address of the binary) and it worked, so it must be a VS regression.
The other is how to get dynamic core (dynrec) to work on ARM64. In it's default state, dynamic core will crash because ARM-based Mac OS X enforces a policy known as "write-xor-execute", meaning that mprotect() will allow you to map read + execute, or read + write, but not read + write + execute. DOSBox-X was able to work around this by attempting the read+write+execute first (as SVN does), and if that fails, does two mprotect() tests to determine whether it's because of "write-xor-execute" or because it's flat out not allowed. If its W^X, then it sets a flag and maps it read+write for the initial dynamic core work before then mapping read+execute. From that point on, whenever it needs to add a cache block, it remembers that W^X is in effect and mprotects it back to read+write during the modification before mapping it back to read+execute. With that modification, dynrec works on the Macbook just fine.
Constantly updating the mapping every time a block is written is slow.
Using two mappings, one using RW protection and the other using RX like this patch, is more efficient. QEMU recently adopted the same method referencing a presentation from Apple's head of security engineering.
The current ARM64 dynrec isn't really worth bothering with anyway, running the dyn_x86 core using emulation will give better performance.
(There is still a glaring bug regardless, being that the cache memory is allocated with malloc - meaning it is undefined behavior to use mprotect on it.)
How do you set up DOSBox-X to show integer scaled window in fullscreen? I can do that easily in original DOSBox but I can't get it to work here.
To be perfectly clear what I want to accomplish: I want to scale 320x200 game to 1280x800 so it's pixel perfect with even pixels and I want to display that 1280x800 window in 1920x1080 fullscreen with black borders around it. So far no matter what fullscreen resolution I set in the config file, or whether I use autofit or aspect ratio correction it never scales that way.
The official DOSBox let's me do it by using OpenGLnb and setting fullresolution=1280x800 in dosboxTex1.conf. If I do the same in X it just displays 1280x800 window.
How do you set up DOSBox-X to show integer scaled window in fullscreen? I can do that easily in original DOSBox but I can't get it to work here.
To be perfectly clear what I want to accomplish: I want to scale 320x200 game to 1280x800 so it's pixel perfect with even pixels and I want to display that 1280x800 window in 1920x1080 fullscreen with black borders around it. So far no matter what fullscreen resolution I set in the config file, or whether I use autofit or aspect ratio correction it never scales that way.
The official DOSBox let's me do it by using OpenGLnb and setting fullresolution=1280x800 in dosboxTex1.conf. If I do the same in X it just displays 1280x800 window.
To be fair, DOSBox-X disabled changing the video mode in fullscreen because modern displays seem to take their time to re-display the desktop on mode changes. They aren't the CRTs of old.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.
jmarshwrote on 2020-11-28, 12:38:Casting to use sign/zero-extension is always preferable over masking because it puts the result in a different register (elimina […] Show full quote
This typecast is optimized out by VS2019's compiler in Release builds, causing only the low 32 bits to be written in any case and a crash when executing the dynamically generated code. DOSBox-X fixes the issue by explicitly masking the pointer by 0xFFFFFFFF for that test instead of relying on typecasting.
Casting to use sign/zero-extension is always preferable over masking because it puts the result in a different register (eliminating a move) to setup the comparison.
I've checked that code with values > 4GB before (by forcing the base address of the binary) and it worked, so it must be a VS regression.
The other is how to get dynamic core (dynrec) to work on ARM64. In it's default state, dynamic core will crash because ARM-based Mac OS X enforces a policy known as "write-xor-execute", meaning that mprotect() will allow you to map read + execute, or read + write, but not read + write + execute. DOSBox-X was able to work around this by attempting the read+write+execute first (as SVN does), and if that fails, does two mprotect() tests to determine whether it's because of "write-xor-execute" or because it's flat out not allowed. If its W^X, then it sets a flag and maps it read+write for the initial dynamic core work before then mapping read+execute. From that point on, whenever it needs to add a cache block, it remembers that W^X is in effect and mprotects it back to read+write during the modification before mapping it back to read+execute. With that modification, dynrec works on the Macbook just fine.
Constantly updating the mapping every time a block is written is slow.
Using two mappings, one using RW protection and the other using RX like this patch, is more efficient. QEMU recently adopted the same method referencing a presentation from Apple's head of security engineering.
The current ARM64 dynrec isn't really worth bothering with anyway, running the dyn_x86 core using emulation will give better performance.
(There is still a glaring bug regardless, being that the cache memory is allocated with malloc - meaning it is undefined behavior to use mprotect on it.)
Constantly updating the map is fast enough on the Macbook so far. I seem to get a 5-10% CPU load reduction according to top.
It looks like there is a Darwin/mach-specific task remap function to make exactly that kind of split mapping, because Apple themselves uses it on iOS to JIT compile JavaScript. However portability is a concern and there are Linux systems with the same restriction, so I may have to either implement both or just use the shmget() shared memory file handle mmap trick that Linux supports.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.
This typecast is optimized out by VS2019's compiler in Release builds, causing only the low 32 bits to be written in any case and a crash when executing the dynamically generated code. DOSBox-X fixes the issue by explicitly masking the pointer by 0xFFFFFFFF for that test instead of relying on typecasting.
Casting to use sign/zero-extension is always preferable over masking because it puts the result in a different register (eliminating a move) to setup the comparison.
I've checked that code with values > 4GB before (by forcing the base address of the binary) and it worked, so it must be a VS regression.
Masking should be optimized properly by the compiler if the mask accomplishes the same thing, at least GCC seems to be smart enough about it.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.
Here is a hint about x86_64 builds running in Apple M1:
"The system prevents you from mixing arm64 code and x86_64 code in the same process. Rosetta translation applies to an entire process, including all code modules that the process loads dynamically."
Yes, obviously. I doubt anyone would expect to be able to do something of this kind. You can also not mix i686 and x86_64 inside one process.
As I've been banging heads with Jmarsh against the ARM64 code, there is also a big problem with Apple's tightened security on the M1. While the compiled binary will work nicely on your own machine, an actual app bundle needs entitlements and codesigning with a developer account to make it work on other's machines. This was not fun to test again...
Based on Re: dynrec vs. secure platforms - opinions wanted we've been testing https://github.com/DominusExult/buildbot/blob … dosbox_wx.patch (with the entitlements at https://github.com/DominusExult/buildbot/blob … itlements.plist). Eventually it seems the way the patch handles the cache tempfile is never going to work on the Apple Silicon from an app bundle.
Because the Apple M1 emulation will not mix and match between architectures, it may be worthwhile to improve the voodoo emulation, perhaps even add banshee. Here is also a 3dfx voodoo patch against previous code in dosbox-x for hints. There was a recent update to that code in dosbox-x that reintroduced the faulty mipmapping emulation for the non-opengl path. It may be worthwhile to pursue other patches given the above comments.
I already modified SDL1.2 in-tree to do almost exactly that for Big Sur to fix the lack of audio. It took a bit of browsing around Apple's developer site, but I figured it out in about half an hour.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.
In that case, it seems that they deprecated the old API for a new API by renaming the functions. 😀 Seems that the native build for Big Sur has no major practical advantage yet over the Intel build in its emulation layer, although your native build is important to maintain even if that is the case.
It looks like there is a Darwin/mach-specific task remap function to make exactly that kind of split mapping, because Apple themselves uses it on iOS to JIT compile JavaScript. However portability is a concern and there are Linux systems with the same restriction, so I may have to either implement both or just use the shmget() shared memory file handle mmap trick that Linux supports.
The vm_remap call is only required on macos because it pretty much disallows mmap'ing anything (shared memory, tempfile, etc.) as executable, and that is the *nix portable way to map the same memory space at two different locations.
Shared memory doesn't work because apple mounts it with no-exec (and several linuxes are doing the same).
The patch from the other thread has been tested and works on SELinux, although it has a few cosmetic issues that need fixing (and could be made more secure by eliminating the offset variable).
You can also not mix i686 and x86_64 inside one process.
You can on windows. All 32-bit apps get loaded with a 64-bit code segment accessible to them, for communicating with the kernel (it's how WoW64 works). Since the segment value is hardcoded and x86 has unprivileged instructions to get a segment's base and limit, it's trivial to find it and map new code into it...
You can also not mix i686 and x86_64 inside one process.
You can on windows. All 32-bit apps get loaded with a 64-bit code segment accessible to them, for communicating with the kernel (it's how WoW64 works). Since the segment value is hardcoded and x86 has unprivileged instructions to get a segment's base and limit, it's trivial to find it and map new code into it...
Back in the Windows 95/98/ME days there were hidden ordinal entry points in KERNEL32.DLL that allowed Win32 code to call Win16 functions. Not everything was 32-bit at the time, so the API was there to let Win32 call down to the Win16 underworld when needed. So at least under Windows 9x/ME, you *could* mix 16-bit and 32-bit code, or at least call 16-bit code.
On the Windows NT kernel side of things, a 16-bit Windows 3.x application running under NTVDM.EXE could easily make calls to Win32 code using the WOW interface.
DOSBox-X project: more emulation better accuracy. DOSLIB and DOSLIB2: Learn how to tinker and hack hardware and software from DOS.