VOGONS


MSYS2/mingw-w64 for Win64 DOSBox

Topic actions

Reply 20 of 37, by fr500

User metadata
Rank Newbie
Rank
Newbie
kjliew wrote:
I have made some progress in debugging this and was able to start WIN64 DOSBox with dynarec core into the DOSBox's DOS prompt. H […]
Show full quote

I have made some progress in debugging this and was able to start WIN64 DOSBox with dynarec core into the DOSBox's DOS prompt. However, launching a DOS program still ended up in segmentation fault.
Summarizing the issues so far:
1. In gen_load_param_reg, the encoding of "mov R8, reg&7" and "mov R8, reg&7" are incorrect. From GDB disassembly, they are actually "mov reg, R8" and mov reg, R9".
2. For WIN64 ABI, space is allocated on the call stack as a shadow store for callees to save those registers. There are cases that gen_call_function_setup/raw were thrashing the return address in stack, and this was a major issue for segmentation fault. One of the instruction is the "INT #imme8".

With those issues fixed, WIN64 DOSBox would boot up into the prompt on dynrec core. I believe there are more cases of stack thrashing for decoding other instructions. It is very tedious to track each of them down. Hopefully, this would give some insight to the DOSBox devs who are familiar with dynrec core to deal with issue #2 above once and for all.

Correction:
Actually, not segmentation fault, it was a hung. That's what make it tough to debug. Otherwise, segmentation fault would have landed into GDB and stack trace could be helpful sometimes....

Hey, do you have a diff?
Have you made further progress?

Reply 21 of 37, by PgrAm

User metadata
Rank Newbie
Rank
Newbie

I believe I have a solution for issue #1, gen_mov_regs will not correctly generate a mov where r8 - r15 are the destination register. instead I made something like this:

// move a full register from reg_src to reg_dst, where dst is r8-r15 and src is rax-rdi
static void gen_mov_regs_reverse(HostReg reg_dst, HostReg reg_src)
{
cache_addb(0x89); // mov reg_dst,reg_src
cache_addb(0xc0 + (reg_src << 3) + reg_dst);
}

// load a host-register as param'th function parameter
static void INLINE gen_load_param_reg(Bitu reg,Bitu param) {
// move a register into a 64bit param reg, {inputregs}!={outputregs}
switch (param) {
case 0: // mov param1,reg&7
gen_mov_regs(FC_OP1,reg&7);
break;
case 1: // mov param2,reg&7
gen_mov_regs(FC_OP2,reg&7);
break;
#if defined (_MSC_VER)
case 2: // mov r8,reg&7
cache_addb(0x49);
gen_mov_regs_reverse(0,reg&7);
break;
case 3: // mov r9,reg&7
cache_addb(0x49);
gen_mov_regs_reverse(1,reg&7);
break;
#else
case 2: // mov rdx,reg&7
gen_mov_regs(HOST_EDX,reg&7);
break;
case 3: // mov rcx,reg&7
gen_mov_regs(HOST_ECX,reg&7);
break;
#endif
default:
E_Exit("R(eg) >4 params unsupported");
break;
}
}

This appears to generate the code correctly but the issue of stack trashing remains. (I'm working with MSVC btw), this gets me to the prompt but the prompt is unresponsive, any attempt to load a program (via autoexec), results in a segfault.

Currently working on new DOS game, Chuck Jones: Space Cop of the Future : https://chuckjonesdevblog.wordpress.com

WARNING: contains rocket powered El Caminos

Reply 22 of 37, by PgrAm

User metadata
Rank Newbie
Rank
Newbie

more progress, prompt is responding after I fixed gen_call_function_setup to allocate windows' required shadow space. Here's my solution:

// generate a call to a function with paramcount parameters
// note: the parameters are loaded in the architecture specific way
// using the gen_load_param_ functions below
static Bit64u INLINE gen_call_function_setup(void * func,Bitu paramcount,bool fastcall=false) {

// align the stack
cache_addb(0x48);
cache_addw(0xc48b); // mov rax,rsp

cache_addb(0x48);
cache_addw(0xec83); // sub rsp,0x08
cache_addb(0x08); // 0x08==return address pushed onto stack by call

cache_addb(0x48);
cache_addw(0xe483); // and esp,0xfffffffffffffff0
cache_addb(0xf0);

cache_addb(0x48);
cache_addw(0xc483); // add rsp,0x08
cache_addb(0x08);

// stack is 16 byte aligned now

cache_addb(0x50); // push rax (==old rsp)

#if defined (_MSC_VER)
cache_addb(0x48);
cache_addw(0xec83); // sub rsp,0x20
cache_addb(0x20); // allocate windows shadow space
#endif

// returned address relates to where the address is stored in gen_call_function_raw
Bit64u proc_addr=(Bit64u)cache.pos-4;

// Do the actual call to the procedure
cache_addb(0x48);
cache_addb(0xb8); // mov reg,imm64
cache_addq((Bit64u)func);

cache_addw(0xd0ff);

#if defined (_MSC_VER)
cache_addb(0x48);
cache_addw(0xc483); // add rsp,0x20
cache_addb(0x20); // allocate windows shadow space
#endif

// restore stack
cache_addb(0x5c); // pop rsp

return proc_addr;
}

Needs more work though, it immediately segfaults when I try to run a real program, there may be more going on

Currently working on new DOS game, Chuck Jones: Space Cop of the Future : https://chuckjonesdevblog.wordpress.com

WARNING: contains rocket powered El Caminos

Reply 23 of 37, by kjliew

User metadata
Rank Oldbie
Rank
Oldbie

@PgrAm @fr500
Attached the patch for my earlier debugging of dynrec CPU core on WIN64 ABI. I am on MSYS2/mingw-x86_64.

This patch fixes the segmentation faults so far, but the core is still broken. Segmentation faults are good for debugging because gdb stack trace can tell some clues on them. After the patch, DOSBox will just hang, some time the console will spit out lots of illegal read/write memory location, but no more faults. This makes debugging very difficult. I am not sure other than arguments passing during function invocation, what else WIN64 ABI would make a difference.

Update: Remove partial fix. See subsequent post for the fix which actually works.

Last edited by kjliew on 2018-11-09, 02:12. Edited 1 time in total.

Reply 24 of 37, by PgrAm

User metadata
Rank Newbie
Rank
Newbie

Thanks @kjliew!

I've gotten a bit further with fixing some issues, (eg. gen_load_param_mem generated incorrect prefixes), Now I'm able to run some simple programs. Succeeded with a program that sets the screen to mode 13h then exits.

Now I will proceed with making some simple assembly language programs as test cases for further debugging to isolate problem areas. More complex programs appear to get stuck in some kind of infinite loop and some of my watcom applications detected a stack overflow.

Here is my diff (I changed _MSC_VER to _WIN64 so hopefully it'll work with mingw64)

Attachments

  • Filename
    risc_x64.diff
    File size
    4.74 KiB
    Downloads
    76 downloads
    File license
    Fair use/fair dealing exception

Currently working on new DOS game, Chuck Jones: Space Cop of the Future : https://chuckjonesdevblog.wordpress.com

WARNING: contains rocket powered El Caminos

Reply 25 of 37, by kjliew

User metadata
Rank Oldbie
Rank
Oldbie

@PgrAm
Thanks, PgrAm!
I got it working finally!!! Each of us having missed 1 part of the fix. By merging our debugging results, we made it! My fix which yours was missing is coded in GCC way though. You need to adapt it for VS, should not be too difficult.

I got PCPBENCH.EXE a DOS4GW running nicely. Will check out more games.

Attachments

  • Filename
    win64_dynrec-2.diff
    File size
    6.69 KiB
    Downloads
    78 downloads
    File comment
    WIN64 working core_dynrec
    File license
    Fair use/fair dealing exception

Reply 26 of 37, by PgrAm

User metadata
Rank Newbie
Rank
Newbie

@kjliew, Awesome!

I applied your changes and came up with a diff that should work with both compilers, I can confirm these changes work in my debug build, however they segfault in release mode. What optimization switches are you compiling with?

I'm also having trouble with adlib detection, seems to work randomly 50% of the time (when using dynrec64).

Attachments

  • Filename
    risc_x64.diff
    File size
    8.47 KiB
    Downloads
    74 downloads
    File comment
    mostly working win64 dynrec
    File license
    Fair use/fair dealing exception

Currently working on new DOS game, Chuck Jones: Space Cop of the Future : https://chuckjonesdevblog.wordpress.com

WARNING: contains rocket powered El Caminos

Reply 27 of 37, by kjliew

User metadata
Rank Oldbie
Rank
Oldbie
PgrAm wrote:

I can confirm these changes work in my debug build, however they segfault in release mode. What optimization switches are you compiling with?

I am getting similar results with different optimization level.
With -O0, so far everything worked. I was also testing for 3Dfx capabilities using Tomb Raider 1 and Blood. They all worked.
With -Og, so far everything worked. It measured ~2x faster than -O0.
With -O1 or -O2, simple DOS .COM work. Most would segmentation fault.

GDB trace showed MOV instruction read from memory address for eg. mov rax, [rsi + rbx * 4], and the address did not look correct. One of the registers is '0' and the memory location is too low for a WIN64 application.

Update: -Og work after fixing the gen_call_function_setup() to make sure stack remain 16-byte aligned. So just ignore the paramcount and always allocate stack space for 4 args.

Reply 28 of 37, by PgrAm

User metadata
Rank Newbie
Rank
Newbie

@kjliew

Think I solved the last problem, windows ABI expects the callee to save RSI, but *nix does not. Simply modified gen_run_code to handle this and BAM! my release build works!

static void gen_run_code(void) {
cache_addb(0x53); // push rbx
#ifdef _WIN64
cache_addb(0x57); // push rdi
cache_addb(0x56); // push rsi
#endif
cache_addw(0xd0ff+(FC_OP1<<8)); // call
#ifdef _WIN64
cache_addb(0x5e); // pop rsi
cache_addb(0x5f); // pop rdi
#endif
cache_addb(0x5b); // pop rbx
}


EDIT: you gotta save RDI too

Adlib detection is still spotty though not sure why yet

Final diff attached

Attachments

  • Filename
    risc_x64.diff
    File size
    8.87 KiB
    Downloads
    96 downloads
    File comment
    Working Windows x64 dynrec core patch
    File license
    Fair use/fair dealing exception
Last edited by PgrAm on 2018-11-09, 16:28. Edited 1 time in total.

Currently working on new DOS game, Chuck Jones: Space Cop of the Future : https://chuckjonesdevblog.wordpress.com

WARNING: contains rocket powered El Caminos

Reply 29 of 37, by kjliew

User metadata
Rank Oldbie
Rank
Oldbie

@PgrAm
Awesome work! I can confirm now my GCC -O2 optimized works. PCPBENCH showed another 20% improvement from -Og. So that finally concluded the Win64 core_dynrec support that made both Windows and Linux 64-bit support at parity.
I think the adlib detection is normal for very fast dynamic cpu core. I think GODS and DUNE2 would also fail MT32 detection when running dynamic core at max 105%.

@QBix
DOSBox devs, please include this fix into SVN.

Reply 30 of 37, by kjliew

User metadata
Rank Oldbie
Rank
Oldbie

@PgrAm I think you had included a redundant windows shadow space allocation in gen_call_function_setup().

Only the this part is needed:

@@ -398,6 +411,12 @@ static Bit64u INLINE gen_call_function_s

cache_addb(0x50); // push rax (==old rsp)

+#if defined (_WIN64)
+ cache_addb(0x48);
+ cache_addw(0xec83); // sub rsp,0x20
+ cache_addb(0x20); // allocate windows shadow space
+#endif
+
// returned address relates to where the address is stored in gen_call_function_raw
Bit64u proc_addr=(Bit64u)cache.pos-4;

@@ -408,6 +427,12 @@ static Bit64u INLINE gen_call_function_s

cache_addw(0xd0ff);

+#if defined (_WIN64)
+ cache_addb(0x48);
+ cache_addw(0xc483); // add rsp,0x20
+ cache_addb(0x20); // allocate windows shadow space
+#endif
+
// restore stack
cache_addb(0x5c); // pop rsp

Reply 31 of 37, by PgrAm

User metadata
Rank Newbie
Rank
Newbie

@kjliew Nice catch, but its actually just the comment is wrong I think, the second one should say "deallocate" instead of allocate. Now that you mention it though, I realize that I could optimize that a bit and save an instruction, right after the function call by restoring the stack pointer directly(which will deallocate the shadow space too) instead of using pop.

#if defined (_WIN64)
//restore our original stack pointer
cache_addd(0x24648b48); //mov rsp, [rsp+0x20]
cache_addb(0x20);
#else //system v abi
// restore stack
cache_addb(0x5c); // pop rsp
#endif

return proc_addr;

EDIT:

Took a look at it again and found I could optimize it even more (saves two additional instructions), here's the whole function:

// generate a call to a function with paramcount parameters
// note: the parameters are loaded in the architecture specific way
// using the gen_load_param_ functions below
static Bit64u INLINE gen_call_function_setup(void * func,Bitu paramcount,bool fastcall=false) {

// align the stack
cache_addb(0x48);
cache_addw(0xc48b); // mov rax,rsp

cache_addb(0x48);
cache_addw(0xec83); // sub rsp,0x08
#if defined (_WIN64)
cache_addb(0x28);// 0x08==return address & shadow space pushed onto stack by call
#else
cache_addb(0x08); // 0x08==return address pushed onto stack by call
#endif

cache_addb(0x48);
cache_addw(0xe483); // and esp,0xfffffffffffffff0
cache_addb(0xf0);

#if defined (_WIN64)
//save our stack pointer
cache_addd(0x24448948); //mov [rsp+0x20], rax (==old rsp)
cache_addb(0x20);
#else
cache_addb(0x48);
cache_addw(0xc483); // add rsp,0x08
cache_addb(0x08);

// stack is 16 byte aligned now
cache_addb(0x50); // push rax (==old rsp)
#endif

// returned address relates to where the address is stored in gen_call_function_raw
Bit64u proc_addr=(Bit64u)cache.pos-4;

// Do the actual call to the procedure
cache_addb(0x48);
cache_addb(0xb8); // mov reg,imm64
cache_addq((Bit64u)func);
cache_addw(0xd0ff);

#if defined (_WIN64)
//restore our original stack pointer
cache_addd(0x24648b48); //mov rsp, [rsp+0x20]
cache_addb(0x20);
#else
// restore stack
cache_addb(0x5c); // pop rsp
#endif

return proc_addr;
}

Currently working on new DOS game, Chuck Jones: Space Cop of the Future : https://chuckjonesdevblog.wordpress.com

WARNING: contains rocket powered El Caminos

Reply 32 of 37, by kjliew

User metadata
Rank Oldbie
Rank
Oldbie

@PgrAm To ease upstream acceptance and enhance future patchability, I usually refrain from unnecessarily touching the underlying code and keep the changes simple. Here's my final patch for your reference. It will be up to DOSBox devs to decide how to integrate the changes. Our combined debugging efforts had provided all the answers to this long dragging issue of 64-bit build for Windows. Cheers 😀 !

Attachments

  • Filename
    risc_x64.diff
    File size
    6.92 KiB
    Downloads
    82 downloads
    File license
    Fair use/fair dealing exception

Reply 33 of 37, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

I see that you changed the defines. Although that should be safe given that it is the x64 core, which shouldn't get compiled on win32.
Looking over the patch, it the ADD and friends changes are a bit surprising, we could have made that depend on FC1_OP and such, but given that there only 2 versions, it might be too much to do that.

Water flows down the stream
How to ask questions the smart way!

Reply 34 of 37, by Qbix

User metadata
Rank DOSBox Author
Rank
DOSBox Author

commited these changes, a bit modified as that is how I had those functions in my local tree.

Water flows down the stream
How to ask questions the smart way!

Reply 35 of 37, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie

Revisiting this; shouldn't it be possible to move all the stack adjustment stuff (the win64 shadow space and 16-byte alignment) into gen_run_code so that it only gets done once instead of for every function call? Alignment should be a piece of cake, the shadow space might be a bit trickier because the return address gets pushed in the way when the CALL FC_OP1 is done...

Reply 36 of 37, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie

Turn out yes, it is possible without too much effort. See attached patch.
I tweaked a few other things too, some cosmetic cleanups and relaxed some 64-bit register usage to 32-bit (for smaller code size). There's still more optimizing that could be done e.g. implementing the DRC_USE_REGS_ADDR/DRC_USE_SEGS_ADDR stuff but I've had enough of dealing with x64 machine code for a while.

Haven't tested on windows but on linux my crappy first-gen atom sees an improvement in PCPBENCH from 5.1 to 5.4fps.

Attachments

  • Filename
    risc_x64.diff
    File size
    11.75 KiB
    Downloads
    62 downloads
    File license
    Fair use/fair dealing exception

Reply 37 of 37, by Dominus

User metadata
Rank DOSBox Moderator
Rank
DOSBox Moderator

I've tested on OS X with output opengl (surface is insanely slow) and these changes give me five more fps in pcpbench (98.5) than SVN (93.5) and ten more than 0.74-2 (88.6)

Windows 3.1x guide for DOSBox
60 seconds guide to DOSBox
DOSBox SVN snapshot for macOS (10.4-11.x ppc/intel 32/64bit) notarized for gatekeeper