Some free online sources for things that might help with emulator development?

Reply 20 of 50, by peterfirefly

Posted on 2025-02-19, 06:56

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

superfury, is it possible to build your code without installing your submodules? I am not going to run 'make install' on some random code (and certainly not as root!).

Is it possible to build your code without using the SDL2 code you put into one of your submodules? That is, can I use whatever upstream SDL2 version I have installed instead?

Reply 21 of 50, by peterfirefly

Posted on 2025-02-19, 07:04

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

peterfirefly wrote on 2025-02-18, 12:48:

superfury wrote on 2025-02-18, 12:24:

danoon wrote on 2025-02-18, 00:47:

As for performance, I tried profilers and they rarely lead me to any thing useful. Sometimes I would just randomly break the program to see what it was doing. If it ended up in the same spot a lot, I would take a closer look at it. But I think you are right to focus on memory. Most of my biggest performance improvements were around memory. Of course the other big one was x86 flag calculation.

That is interesting that PSP Homebrew doesn't support c++. I started out with Java then moved to C for my project. I moved to c++ maybe 10 years ago. It was a big effort.

It could be because I'm using the normal C compiler though (CFLAGS etc instead of CXXFLAGS).

GCC really doesn't like to be invoked as 'gcc' for C++ code. The reason is that it is really a wrapper that invokes various other programs and that it generates different command lines for them depending on how it is invoked. It adds extra flags + extra include/library paths for C++. It is possible to get those right manually but 1) it is not easy 2) it isn't stable.

It seems like PSP Homebrew supports Rust. It would be really weird if it doesn't support C++.

Another gcc (Linux, really) follow up...

If you are building for Linux, you might be leaving a little performance on the table by not using 'static' on functions that are only used from within a compilation unit.

Linux uses ELF and ELF supports symbol interposing. That means that all functions by default can be overridden by loading a dynamic library later. This can be done from the command line using the LD_PRELOAD environment variable, for example.

It is really useful for logging/debugging and in certain cases for bug fixing.

Most of the time, though, it just makes the code a little bit slower and bigger.

It works by having all non-static function calls go through a table of function pointers. That is a little inefficient but the big killer is that it prevents inlining.

Play around with a disassembler to see the difference in the generated code 😀

1#if 1
2# define ST static
3#else
4# define ST
5#endif
6
7int  x=27; // perhaps make this static as well...
8
9ST int g()
10{
11  return x;
12}
13
14void f()
15{
16  printf("%d\n", g());
17}

Reply 22 of 50, by superfury

Posted on 2025-02-19, 18:54

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5830
Joined: 2014-03-08, 11:25
Location: Netherlands

The common emulator framework is mandatory. The tools repository is used for PSP compilation only (basically just it's Makefile and some windows batch files to call make and copy to emulator directories etc.).
The SDL repository is simply there to optionally compile it with bitbucket's pipelines.

So for basic linux, all you need is the commonemuframework repository, but not the tools repository inside of it.

The commonemuframework repository contains the basic SDL/SDL2/SDL3 wrappers and various I/O support UniPCemu uses. It's also used by my other project I started it with (a gameboy emulator (based on https://imrannazar.com/GameBoy-Emulation-in-JavaScript) I worked on before starting UniPCemu, whose names partly remain in various function naming (like wb and rb etc.) and basic loop stuff (although I never got the gameboy emulation fully running (just some sprites rendered when selected manually on all glyph locations, probably a gameboy GPU framebuffer issue somehow)). I basically use that code repository to keep all generic stuff that isn't emulator-specific (like the main loop, I/O of sound, video, mice, keyboard, other input support, text rendering etc.). All specific hardware (CPU, video cards, audio cards, other emulated hardware) otoh is inside the main repository.
The SDL repository commits are actually straight dumps of various SDL2 versions' .zip or .tar.gz archives of it's source code (with the root directory extracted into it, when looking at their contents). Those are straight from their release page back on libsdl.org during their hg days (before they switched to git). Although nowadays you can get them from the git repository by checking it out at a specific version or dump it from their release page's source code .tar.gz for a version (taking the contents of the folder inside it).

So basically:
- The UniPCemu repo handles the x86 emulation and it's hardware.
- commonemuframework handles all I/O and some generic support functions.
- tools (inside commonemuframework) handles PSP compiler support (optionally, not required for non-psp building). It's makefile called from the psp target inside the commonemuframework's PSP makefile (make's 'psp' target, like "make psp build" on the command line).
- commonemuframework-SDL2 is a dump of some SDL2 source code of official versions (from libsdl.org) for bitbucket's pipelines (disabled right now inside bitbucket-pipelines.yml and set to manual mode). If enabled or ran manually from the pipeline it's used to build a linux executable using SDL2 from said subrepository using bitbucket's pipeline system (in multiple steps).

So just the main UniPCemu and commonemuframework (w/o tools inside it) are required to be checked out to build for linux. Originally it was all in one repository, but I split it up to be used in both my emulators on multiple platforms etc. with ease (and updating both with bugfixes and improvements easily. Hence why I called it a framework. A bit like .NET etc. if you think about it). I can assure that I wrote all code in both those repositories myself (although MD5 is adjusted from a generic codebase online and the PSP makefile inside the tools repository is taken from the devkit with split build directories added and more neat display of compiler commands added by me. You can in theory even replace it with the official PSPSDK's build.mak, although all object files etc. will end up in the source code directories and a lot of compiler commands being made visible in that one).
The common emulator framework also handles stuff that's commonly used, like most SDL1/2 makefiles with all required stuff.
On linux you'd need to run configure etc. first, as the default Makefile is configured for MSYS2/mingw on Windows. That basically will configure it for your build system (replace paths used, compiler commands etc.). It's the basic autogen.sh, configure, make combination. Although make will need stuff like the linux target for the correct linux-compatible makefile to be used.

The common emulator framework also supplies most makefiles for the specific platforms (Android, Linux, PSP(partly, a wrapper for tools/build.mak), Windows(mingw/msys2), switch, vita).

So I keep the stuff that's platform or SDL-specific inside the common emulator framework and the main emulator code (that is called from commonemuframework's main.c) in the root of the UniPCemu repository.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 23 of 50, by superfury

Posted on 2025-02-19, 19:49

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5830
Joined: 2014-03-08, 11:25
Location: Netherlands

OK. Baresifter either finished prematurely or crashed (it reset the CPU).

The last entry logged into it's log is:

126:12:21:23.06736: EXC 0E OK | 0F A8

So somewhere from opcode 0F A9 and up it crashed somehow (which it shouldn't, it didn't used to do that). Ran for 26 hours so far it seems (26 hours 12 minutes 21 seconds 23/100th second and 673.6uS from what I can see).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 24 of 50, by peterfirefly

Posted on 2025-02-19, 20:30

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

superfury wrote on 2025-02-19, 18:54:

The common emulator framework is mandatory. The tools repository is used for PSP compilation only (basically just it's Makefile and some windows batch files to call make and copy to emulator directories etc.).
The SDL repository is simply there to optionally compile it with bitbucket's pipelines.

So for basic linux, all you need is the commonemuframework repository, but not the tools repository inside of it.

peterfirefly wrote on 2025-02-19, 06:56:

superfury, is it possible to build your code without installing your submodules? I am not going to run 'make install' on some random code (and certainly not as root!).

Is it possible to build your code without using the SDL2 code you put into one of your submodules? That is, can I use whatever upstream SDL2 version I have installed instead?

So how do I build it without installing your commonemuframework library/header files? Will I have to make a separate WSL2 VM just for you?

superfury wrote on 2025-02-19, 18:54:

On linux you'd need to run configure etc. first, as the default Makefile is configured for MSYS2/mingw on Windows. That basically will configure it for your build system (replace paths used, compiler commands etc.). It's the basic autogen.sh, configure, make combination. Although make will need stuff like the linux target for the correct linux-compatible makefile to be used.

1€  cd UniPCemu
2€ ./configure
3checking for gcc... gcc
4checking whether the C compiler works... yes
5checking for C compiler default output file name... a.out
6checking for suffix of executables...
7checking whether we are cross compiling... no
8checking for suffix of object files... o
9checking whether the compiler supports GNU C... yes
10checking whether gcc accepts -g... yes
11checking for gcc option to enable C11 features... none needed
12configure: creating ./config.status
13config.status: creating Makefile
14€ make
15Use get-properties <propertyname> to print single properties of a project
16/home/firefly/repos/unipcemu/UniPCemu/../commonemuframework/Makefile.multiplatform:34: *** Please specify a platform (psp,win,linux,android) and action ((re)build or (re)clean), e.g. make win build or make win clean. Optional other targets(besides build/clean) are (re)debug, rebuild, clean, SDL2(static) to use (static) SDL2 linking instead of dynamic SDL linking, (re)profile, analyze(2), x64(for MinGW64 only), mingw32 for MSYS2 32-bit compilation, mingw64 for MSYS2 64-bit compilation.  Stop.
17€ make linux build
18/bin/bash: line 1: sdl-config: command not found
19/bin/bash: line 1: sdl-config: command not found
20make: Nothing to be done for 'linux'.
21Compiling ../commonemuframework/basicio/fopen64.c
22In file included from ../commonemuframework/headers/types.h:124,
23                 from ../commonemuframework/basicio/fopen64.c:21:
24../commonemuframework/headers/types_linux.h:105:10: fatal error: SDL/SDL.h: No such file or directory
25  105 | #include <SDL/SDL.h> //SDL library!
26      |          ^~~~~~~~~~~
27compilation terminated.
28make: *** [../commonemuframework/Makefile.linux:262: ../../projects_build/UniPCemu/linux/___/commonemuframework/basicio/fopen64.o] Error 1

Doesn't quite work...

And if I try to build commonemuframework separately -- I still don't know if I'm supposed to do that, btw.

1€ cd commonemuframework
2€ make -f Makefile.linux
3fatal: No names found, cannot describe anything.
4make: sdl-config: No such device
5make: sdl-config: No such device
6cat: /../.git/HEAD: No such file or directory
7Creating profiling executable prof...
8Creating release executable copy of prof...
9/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/gcrt1.o: in function `_start':
10(.text+0x1b): undefined reference to `main'
11collect2: error: ld returned 1 exit status
12make: *** [Makefile.linux:234: ../../projects_build//linux/prof] Error 1

I have SDL2 installed:

1€  apt search libsdl2-dev
2Sorting... Done
3Full Text Search... Done
4libsdl2-dev/jammy-updates,now 2.0.20+dfsg-2ubuntu1.22.04.1 amd64 [installed]
5  Simple DirectMedia Layer development files

SDL "non-2" is not an option:

1€  apt search libsdl-dev
2Sorting... Done
3Full Text Search... Done

I am running Ubuntu 22.04:

1€  cat /etc/lsb-release
2DISTRIB_ID=Ubuntu
3DISTRIB_RELEASE=22.04
4DISTRIB_CODENAME=jammy
5DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"

Reply 25 of 50, by superfury

Posted on 2025-02-19, 20:43

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5830
Joined: 2014-03-08, 11:25
Location: Netherlands

It's calling sdl-config because you've configured it (using the make command) to compile using the default SDL 1.2 library.
Add "SDL2" (without quotes) to the command line of the make command.
So "make linux build SDL2".
That should make it use SDL2 instead.

The makefile.linux can't be called directly, as it requires the input from the Makefile.files file to give it information on the files to compile.
Change the directory to the UniPCemu directory (that's located in the root of the UniPCemu main repository) and run the above make command from there (after running autogen.sh and the configure command).
The configure command(with autogen to generate it) basically creates a Makefile in the UniPCemu directory itself (not committed to git) that configures the other makefiles to call the correct gcc and related programs.
The linux target makes Makefile.multiplatform select the .linux file, where the .linux file takes the input from Makefile.files (which includes the Makefile.multiplatform) to start the compilation process based on the settings provided in the Makefile itself (that's written by the ./configure command).
The Makefile itself is just a simple container to make the configure command provide environment settings to the other makefiles. Said default Makefile in git simply defaults to the mingw toolchain-compatible settings (which I'm using on MSYS2 on Windows).
Android (Android Studio) builds exploit this by including the Makefile.files directly and instructing the multiplatform layer to not include any specific platform Makefile and just make it return the file list, which is then used by the Android Makefile (The Android NDK Makefile) to compile the files inside Android Studio (and it has a compatibility layer as well for building with the old Android NDK command line based method originally used in SDL 2.0.0, which has officially been deprecated by Android Studio nowadays, although still supported in theory if you use those old toolchains (assuming SDL2/3 is still compatible with that)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 26 of 50, by peterfirefly

Posted on 2025-02-19, 21:28

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

superfury wrote on 2025-02-19, 20:43:
It's calling sdl-config because you've configured it (using the make command) to compile using the default SDL 1.2 library. Add […]
Show full quote

It's calling sdl-config because you've configured it (using the make command) to compile using the default SDL 1.2 library.
Add "SDL2" (without quotes) to the command line of the make command.
So "make linux build SDL2".
That should make it use SDL2 instead.

'make linux build SDL2' does the trick!

There are some warnings, most of them about snprintf() and also a few about unused variables.

Then I had to play "hunt the executable" for a while.

I cloned your repo into ~/repos/unipcemu so the executable ended up as ~/repos/projects_build/UniPCemu/UniPCemu.

Please don't ever do anything like that. Make your build stay in its lane. There's a reason why I didn't just wanted to run "make install", you know.

And -- please! -- don't ever do anything like that unannounced and undocumented!

Reply 27 of 50, by peterfirefly

Posted on 2025-02-19, 23:18

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

Danoon, I just built Boxedwine.

Everything was easy, everything worked the way it should. Zero problems -- until "make_shared<T[]> not supported".

Turns out I used gcc 11. Once I switched to gcc 12, everything Just Worked™.

I do get some warnings:

1€ cd Boxedwine/project/linux
2€ make
3uname -m is x86_64
4MAKEFLAGS is -j 3
5make[1]: Entering directory '/home/firefly/repos/Boxedwine/project/linux'
6MAKEFLAGS
7In file included from ../../include/platform.h:143,
8                 from ../../include/boxedwine.h:60,
9                 from /home/firefly/repos/Boxedwine/source/util/synchronization.cpp:1:
10../../include/log.h: In instantiation of ‘void kpanic(const char*, Args&& ...) [with Args = {}]’:
11/home/firefly/repos/Boxedwine/source/util/synchronization.cpp:147:11:   required from here
12../../include/log.h:30:34: warning: format not a string literal and no format arguments [-Wformat-security]
13   30 |         auto size = std::snprintf(nullptr, 0, format, std::forward<Args>(args)...);
14      |                     ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15../../include/log.h:32:22: warning: format not a string literal and no format arguments [-Wformat-security]
16   32 |         std::snprintf(msg.str(), size + 1, format, std::forward<Args>(args)...);
17      |         ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18In file included from ../../include/platform.h:143,
19                 from ../../include/boxedwine.h:60,
20                 from /home/firefly/repos/Boxedwine/source/util/player.cpp:1:
21../../include/log.h: In instantiation of ‘void klog(const char*, Args&& ...) [with Args = {}]’:

(There are lots of those)

And then it fails:

1In file included from /usr/include/c++/11/memory:77,
2                 from ../../include/boxedwine.h:8,
3                 from /home/firefly/repos/Boxedwine/source/ui/utils/readIcons.cpp:1:
4/usr/include/c++/11/bits/shared_ptr.h: In instantiation of ‘std::shared_ptr<_Tp> std::allocate_shared(const _Alloc&, _Args&& ...) [with _Tp = unsigned char []; _Alloc = std::allocator<unsigned char []>; _Args = {int&}]’:
5/usr/include/c++/11/bits/shared_ptr.h:878:39:   required from ‘std::shared_ptr<_Tp> std::make_shared(_Args&& ...) [with _Tp = unsigned char []; _Args = {int&}]’
6/home/firefly/repos/Boxedwine/source/ui/utils/readIcons.cpp:340:40:   required from here
7/usr/include/c++/11/bits/shared_ptr.h:860:21: error: static assertion failed: make_shared<T[]> not supported
8  860 |       static_assert(!is_array<_Tp>::value, "make_shared<T[]> not supported");
9      |                     ^~~~~~~~~~~~~~~~~~~~~
10/usr/include/c++/11/bits/shared_ptr.h:860:21: note: ‘!(bool)std::integral_constant<bool, true>::value’ evaluates to false
11/usr/include/c++/11/bits/shared_ptr.h: In instantiation of ‘std::shared_ptr<_Tp> std::allocate_shared(const _Alloc&, _Args&& ...) [with _Tp = unsigned char []; _Alloc = std::allocator<unsigned char []>; _Args = {int}]’:
12/usr/include/c++/11/bits/shared_ptr.h:878:39:   required from ‘std::shared_ptr<_Tp> std::make_shared(_Args&& ...) [with _Tp = unsigned char []; _Args = {int}]’
13/home/firefly/repos/Boxedwine/source/ui/utils/readIcons.cpp:371:62:   required from here
14/usr/include/c++/11/bits/shared_ptr.h:860:21: error: static assertion failed: make_shared<T[]> not supported
15/usr/include/c++/11/bits/shared_ptr.h:860:21: note: ‘!(bool)std::integral_constant<bool, true>::value’ evaluates to false
16make[1]: *** [makefile:138: Build/MultiThreaded//home/firefly/repos/Boxedwine/source/ui/utils/readIcons.cpp.o] Error 1
17make[1]: *** Waiting for unfinished jobs....
18make[1]: Leaving directory '/home/firefly/repos/Boxedwine/project/linux'
19make: *** [makefile:51: multiThreaded] Error 2
20€

Tried again with gcc 12:

1€ make clean
2€ CC=gcc-12 CXX=g++-12 make
3...
4same warnings as with gcc 11.
5...
6make[1]: Leaving directory '/home/firefly/repos/Boxedwine/project/linux'
7€

It works!

Versions:

1€ gcc --version
2gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
3Copyright (C) 2021 Free Software Foundation, Inc.
4This is free software; see the source for copying conditions.  There is NO
5warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
6
7€ gcc-12 --version
8gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
9Copyright (C) 2022 Free Software Foundation, Inc.
10This is free software; see the source for copying conditions.  There is NO
11warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12
13€ cat /etc/lsb-release
14DISTRIB_ID=Ubuntu
15DISTRIB_RELEASE=22.04
16DISTRIB_CODENAME=jammy
17DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
18€

In other words, project/linux/buildInstructions.txt should probably be updated ("you need to make sure you have gcc version 8 or higher").
Yes, BUILD.md does say "You need to have GCC 12 or higher.". I read both and "version 8" was what stayed in my memory.

"This means running Debian 12 or Ubuntu 23 or higher" does not seem to be true. "apt install gcc-12 g++-12" works fine on Ubuntu 22.04.

Reply 28 of 50, by peterfirefly

Posted on 2025-02-20, 09:09

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

Running Boxedwine didn't really go well...

Finding the executable was easy.

The GUI is nice. I love that it checks for a Wine and offers to download one for me (Wine 9.0).

Unfortunately, I haven't been able to make it run anything.

I tried installing some of the demos. That didn't go well.

I tried Age of Empires first. I got "alloc64kBlock: failed to commit memory : Cannot allocate memory" and the install failed.

I tried Abiword. It installed but launching it gave me the same error.

I tried Diablo II shareware. Same error, won't even install.

It also seems to be very, very slow.

Quitting Boxedwine by closing the window sorta worked: it took a really long time to react and this was what was printed on the terminal:

1€ ./boxedwine
2Starting ...
3alloc64kBlock: failed to commit memory : Cannot allocate memory
4alloc64kBlock: failed to commit memory : Cannot allocate memory
5alloc64kBlock: failed to commit memory : Cannot allocate memory
6Boxedwine shutdown
7Segmentation fault (core dumped)
8€

I thought I maybe got the "Normal" CPU, perhaps even without caching of decode blocks.

Seems like BT_FLAGS (in the makefile) isn't used unless I specify 'multiThreaded' on the command line when I run make.

So I tried rebuilding Boxedwine with "CC=gcc-12 CXX=g++-12 make multiThreaded" and doing it again. It still didn't work.

I tried installing Wine 6.0. Still didn't work.

Clicking the close button of the window didn't work anymore. After several minutes, I had to give up by using "killall -9 boxedwine". Just "killall boxedwine" doesn't work for some reason.
The weird thing is that the GUI runs a "please wait" animation flawlessly, so why doesn't it react when I click the "close window" button?

Is there a way to get the executable to tell me how it was compiled? Something like uname, /etc/lsb-release, compiler, compiler version, important #define's?

Reply 29 of 50, by peterfirefly

Posted on 2025-02-20, 09:25

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

danoon, I've been reading through some of your code 😀

It's fairly readable but, boy, there is a lot of it!

My first question was how you handle the x86 memory model on ARM hosts (only relevant for SMP, of course). A quick glance didn't spot any memory barriers in armv8btAsm.cpp, for example.

I wrote an AMD64 disassembler years ago (64-bit only) and that experience taught me that x86 decoding isn't nearly as complicated and slow as people kept saying. I haven't written an emulator so there might be some nuances I don't get yet.

Your decoder is a lot less table-driven than my old AMD64 disassembler was. Lotsa classes... lots of abstraction.

How slow/fast is it? How much binary code does it compile to? Is I€ pressure an issue?

Looks like your decoder system pretends that prefixes are instructions -- which also explains why you talk about LOCK as if it were an instruction:

https://github.com/danoon2/Boxedwine/blob/mas … CPUemulation.md

1The normal CPU does not handle the "lock" instruction in x86.

How do you detect instructions > 15 bytes (due to repeated prefixes) so you can generate an exception for that?

I love the franken-architecture of your emulator: WINE on top of an emulated Linux ABI + emulated OpenGL + emulated X11 + emulated OSS, all of it next to an x86 emulator!
It must have been a lot of work. Seriously, a lot of work.

That clears up a mystery regarding the software TLB you sketched out the other day.

Having 2^20 entries makes a TLB invalidation really slow so it can't happen too often. I figured you might have had a mechanism to track which parts were in use in order to make TLB invalidation cheaper. Doesn't it seem so.

But your franken-architecture is what saves you! Since you are not running a real Linux kernel, the "kernel" won't do any TLB flushes internally + won't need to do any direct mapping of all the physical memory. There is no X11 process so there's no TLB invalidation when switching between the X11 server and the running Windows program.

One of the upsides is that you don't need to emulate any hardware what so ever. This means there's no memory-mapped I/O so the software TLB can be simpler.

The downside is that you need to handle memory-mapping of files, which leads to some extra complexity in the software TLB.

Reply 30 of 50, by peterfirefly

Posted on 2025-02-20, 09:54

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

superfury wrote on 2025-02-18, 23:07:

Also, slightly related, replacing much of those caches with added function pointers and flags is too much. The PSP only has some 16 or so MB of memory, it barely has any memory left with all hardware installed (and a small 1MB soundfont too), so I'm already trying to conserve memory as much as possible while trying to make it faster.

I forgot the PSP was that old! Yeah, 2^20 entries is a lot when you only have 16-32 MB...

That's not a problem, though. I only ran with danoon's example because it was so straightforward to use a 2^20-entry direct-mapped software TLB.

The important part is to use a single software TLB (and avoid using function pointers and extra calls where possible), not the specific way the software TLB is constructed.

It is not at all hard to design a much smaller software TLB, so let's do that!

First, let's make two remarks regarding correctness and performance.

correctness:
INVLPG is allowed to invalidate more than one page mapping from the TLB.

performance:
Having a smaller TLB that gets invalidated a little eagerly may not be slow in practice.

The 2^20-entry direct-mapped software TLB is that big because it covers the entire 4GB virtual address space. We don't actually need to cover more than a small fraction of the virtual address space. Being direct-mapped is nice, though. A real HW TLB benefits from being 2-way, 4-way, or even more. A software TLB likely doesn't (because hardware is really good at doing things in parallel and software is really good at doing things that are complex).

The software TLB I'm going to sketch out will still be direct-mapped but it won't be able to cover all 4GB. It will only have, say, 2048 entries.

The first trick to a smaller software TLB is to use a radix tree.

1ASCII art of a radix tree for 2^20 bits w/ 2^7 leaf nodes.
2
3  virtaddr:
4
5  3322222222221111111111
6  10987654321098765432109876543210
7  ------                           [31:26] 6-bit index into root node
8        -------                    [25:19] 7-bit index into mid-level node
9               -------             [18:12] 7-bit index into leaf node
10                      ------------ [11:0] 12-bit offset into page
11
12 Root node       Mid-level node
13  +----+          +-----+     Leaf node
14  |  0 |--------->|   0 |      +-----+
15  |  1 |          |   1 |----->|   0 |---->ptr to page RAM inside emulator
16  ......          .......      |   1 |
17  | 63 |          | 127 |      .......
18  +----+          +-----+      | 127 |
19                               +-----+

The node sizes are up to you. The number of levels in the tree is up to you. This is just an example.

A radix tree allows you to have a sparse representation of what could, in principle, be a full 2^20-entry direct-mapped software TLB.

Going through the tree for every memory access is not necessary -- the next
code fetch is likely going to use the same leaf node as the previous code fetch,
the next data fetch is likely going to use the same leaf node as the previous
data fetch, etc.

Tracking the last leaf node used for fetch/read/write -- or even fetch + read
for each segment + write for each segment -- will likely speed things up a bit.

The actual physical address are not going to be used often -- unless you are
doing debugging, logging, or emulating hardware breakpoints. So why store them
in the same leaf node array as the ptr to the page RAM and the flags/function
pointers? The CPU caches will probably like it if the physical addresses get their
own leaf node array.

You don't need to store a full 4/8-byte pointer to host memory in each leaf node entry. Assuming you allocate all the host memory needed in a single block, you can get by with just using an offset relative to that block (and since you are using 4KB pages, that offset can count pages instead of bytes). If the system you emulate won't have much memory, you don't need a lot of bits for that. 4-8MB is 1024-2048 pages, which only takes 12-13 bits. If you use 16 bits, you can handle 65536 pages (256 MB). There are only about a handful of different memory types so you can represent that with a small integer. That means you can pack the whole thing into 32 bits -- or maybe 16 bits for the "pointer" to host memory + 8 bits for the flags/memtype.

The number of expected live TLB entries is quite small. 2048 entries is enough to map 4MB twice. Windows maps every page roughly twice, I believe. Once for a direct mapping that allows the kernel to access every page and once for the pages mapped by the currently running userspace program.

With 128 pages per leaf node, this gives us up to 16 leaf nodes.

Worst case for the tree would then be:

1 root node
16 mid-level nodes
16 leaf nodes

The pointers in the root node could be normal pointers -- or just an index into an array of mid-level nodes.

The pointers in the mid-level nodes could be normal pointers -- or just an index into an array of leaf nodes.

1 Software TLB budget so far:
2   a leaf node array for page offsets:     2048*2 = 4096 bytes
3   a leaf node array for physaddrs:        2048*4 = 8192 bytes
4   a leaf node array for flags/funcptridx: 2048*1 = 2048 bytes
5                                          ---------------------
6   subtotal:                                       14336 bytes
7
8   a root node:                            64*{1|4|8} = 64/256/512 bytes
9   16 mid-level nodes:                     16*128*{1|4|8} = 2048*{1|4|8} = 2048/16384/32768 bytes
10                                         ---------------------------------------------------------
11   subtotal:                                  2112-33280 bytes
12
13   total:                                 16448-47616 bytes

So how do we implement TLB invalidation? Since this TLB is a lot smaller than
previous suggestion, it doesn't cost so much to invalidate the whole thing,
if necessary.

If we want to make full invalidations cheaper, we can track which leaf nodes
are in use, either by putting those on a linked list or by having a bit map
that says which ones are in use.

Such a linked list (or bitmap) doesn't have to operate on the leaf nodes directly,
that is, it can operate on smaller clusters of pages or even individual pages.

Since even a full bitmap that covers the entire 2048 entries is quite small (256 bytes)
and is natural to step through more than one bit at a time (say 32 bits = 64
steps), using a bit map doesn't cost much.

Maybe use clusters of 8 pages? 16 pages? I dunno, I just wanted to be explicit
about there being a degree of design freedom here.

Here's a sketch of how different kinds of memory would work:

1    normal RAM:            canFetch, canRead, canWrite
2    normal ROM:                      canRead              -- function ptrs for write
3    unmapped:                                             -- function ptrs for fetch + read + write
4    unmapped*:             canFetch, canRead              -- function ptr for write
5    BIOS:                            canRead              -- function ptr for fetch + write
6    VRAM:                                                 -- function ptr for fetch + read + write
7    MMIO:                                                 -- function ptr for fetch + read + write
8    ROM mapped over RAM:             canRead, canWrite    -- function ptr for fetch
9    MMIO mapped over ROM:                                 -- function ptr for fetch + read + write

The alternative unmapped* would map all unmapped pages to the same 4KB page of
0xFF bytes. Reads would be full speed, writes would be ignored.

Normal ROM (if it exists in your system) would not trap fetches from certain
addresses. Fetches and reads would just be normal inline reads. Writes would
be ignored.

BIOS traps fetches from certain addresses + ignores writes.

VRAM is weird so everything has to go through function ptrs.

MMIO can be used for all other memory-mapped I/O (such as APIC).

Shouldn't be hard to fit all this into 8 bits or less.

The memory type number would be used to index into an array of function pointers -- but they wouldn't be used most of the time. Most memory accesses would have the relevant canXXX flag turned on so they would just be normal reads/writes.

This software TLB sketch is a lot smaller than what Boxedwine uses. It uses a 32-bit word for each TLB entry. 20 bits are used to map to host memory, 12 bits are used for flags and memory types. Boxedwine doesn't use physical addresses for anything.

Reply 31 of 50, by peterfirefly

Posted on 2025-02-20, 10:05

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

superfury, I have no idea what to do here.

I can barely read the font used on top of the screen. I am not sure I am getting all the text in the bottom right corner or if something has been cut off.

(I also don't understand why it takes about a second to shut down various subsystems before the app quits. What does it do that would take more than 10ms?)

Reply 32 of 50, by superfury

Posted on 2025-02-20, 20:04

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5830
Joined: 2014-03-08, 11:25
Location: Netherlands

You can use the parameter fullscreenwindow to make it larger.

One of the reasons some subsystems take longer is due to saving data into the settings file, which is pretty slow. Also the modem's packet server takes longer due to sending network commands with longer timeouts (to deallocate IP(X) addresses).

The RAM already uses a sort of paging-ish table that's cached by the RAM module (one entry for reads, one entry for writes and one entry for code fetches. All are doubled to provide for DMA as well (for reads and for writes only though). So 5 out of 8 entries used at a time.
Writing to RAM invalidates the read caches that match the locations. It checks for overlap on all CPU caches, thus is very slow.
The VGA RAM and BIOS/Option ROMs are of the 'uncached' type. The BIOS ROM due to being flash (thus stuff included like status bits etc.) and VGA RAM due to being dynamic.
VGA RAM can also be mapped over nornal RAM (like with ET4000 chipsets).
For all cases flash ROM writes invalidate the caches too.

One reason the BIU is relatively slow on IPS clocking mode is that it dumbly fetches data till the PIQ is filled to 12 bytes (max instruction length) each instruction. Thus a lot of instruction reads happen.

The single-address block caching might also be slower due to only having one entry per fetch/read/write type. Thus random access kills it. They essentially constantly get invalidated each read/write on x86, as each instruction addresses a different RAM/ROM block.

Imagine a typical function startup in the BIOS:
- read BIOS ROM for instruction (instruction cach, no problem)
- instruction pushes data on stack
(Both different addresses, so invalidates the simple 1-entry cache each time)
But usually instructions jump all over, loops etc, thus destroying caches. Or moves data between 2 RAM/ROM locations, destroying a 1-entry cache on every read and write from memory.
Usually made worse with lots of accesses to different parts of RAM and ROM interleaved. So usually the read/write caches don't last more than 1 instruction, 2 at most.
Unless you'd make the caches use multiple entries like the CPU's own TLB has (4 way for example in my x86 CPU emulation's paging unit). That lessens it somewhat. Though making write invalidation heavy too, for every byte/word/dword written.
Linear access speed is pretty quick, though, as everything mostly uses the caches properly. It;s just the random accesses killing that there.
And didn't even include Compaq and Inboard memory remapping during runtime.

During startup/teardown it also renders text frames, which although optimized, is very slow (in the order of milliseconds/frame or worse). It redraws the whole video output screen cache (last frame) with black borders if needed, followed by the various text layers on top (some 3 layers). Text layers being processed in 3 layers based on each other: 1. text/color to font mapping, 2. Font border drawing, 3. Rendering tranaparently onto the current frame output. Done in a background-to-foreground layers. Finally (also slow) pushing to SDL1/2/3 for rendering using SDL 1.2-compatible rendering methods. I've optimized it a lot, but it's still not very fast compared to others (no hw axceleration after all).

Last edited by superfury on 2025-02-20, 20:46. Edited 2 times in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 33 of 50, by peterfirefly

Posted on 2025-02-20, 20:41

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

superfury wrote on 2025-02-20, 20:04:

You can use the parameter fullscreenwindow to make it larger.

Thank you. I'll try that tomorrow 😀

I would like to get to the point where I can actually run some 32-bit x86 code so I can instrument your memory translation and see how costly it is + get some numbers for how often INVLPG and full TLB invalidation happens.

Ideally implement the scheme I suggested in parallel w/ your existing scheme being the oracle it is tested against. Then I could also get an idea of how big the actual working set (of pages) gets and how many pages mappings usually need to get invalidated.

We'll see. But I need to figure out how to actually run some code first.

superfury wrote on 2025-02-20, 20:04:
One of the reasons some subsystems take longer is due to saving data into the settings file, which is pretty slow. Also the mode […]
Show full quote

One of the reasons some subsystems take longer is due to saving data into the settings file, which is pretty slow. Also the modem's packet server takes longer due to sending network commands with longer timeouts (to deallocate IP(X) addresses).

The RAM already uses a sort of paging-ish table that's cached by the RAM module (one entry for reads, one entry for writes and one entry for code fetches. All are doubled to provide for DMA as well (for reads and for writes only though). So 5 out of 8 entries used at a time.
Writing to RAM invalidates the read caches that match the locations. It checks for overlap on all CPU caches, thus is very slow.
The VGA RAM and BIOS/Option ROMs are of the 'uncached' type. The BIOS ROM due to being flash (thus stuff included like status bits etc.) and VGA RAM due to being dynamic.
VGA RAM can also be mapped over nornal RAM (like with ET4000 chipsets).
For all cases flash ROM writes invalidate the caches too.

One reason the BIU is relatively slow on IPS clocking mode is that it dumbly fetches data till the PIQ is filled to 12 bytes (max instruction length) each instruction. Thus a lot of instruction reads happen.

The single-address block caching might also be slower due to only having one entry per fetch/read/write type. Thus random access kills it. They essentially constantly get invalidated each read/write on x86, as each instruction addresses a different RAM/ROM block.

Are you talking about the 128-bit things? Or are you talking about cached address translations?

If the latter, I wonder if it would be useful to have an extra cached block for read-modify-write access? It should fit nicely into your existing scheme -- if it's an address translation c ache, that is.

Reply 34 of 50, by superfury

Posted on 2025-02-20, 20:57

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5830
Joined: 2014-03-08, 11:25
Location: Netherlands

peterfirefly wrote on 2025-02-20, 20:41:
Thank you. I'll try that tomorrow :) […]
Show full quote

superfury wrote on 2025-02-20, 20:04:

You can use the parameter fullscreenwindow to make it larger.

Thank you. I'll try that tomorrow 😀

I would like to get to the point where I can actually run some 32-bit x86 code so I can instrument your memory translation and see how costly it is + get some numbers for how often INVLPG and full TLB invalidation happens.

Ideally implement the scheme I suggested in parallel w/ your existing scheme being the oracle it is tested against. Then I could also get an idea of how big the actual working set (of pages) gets and how many pages mappings usually need to get invalidated.

We'll see. But I need to figure out how to actually run some code first.

superfury wrote on 2025-02-20, 20:04:
One of the reasons some subsystems take longer is due to saving data into the settings file, which is pretty slow. Also the mode […]
Show full quote

One of the reasons some subsystems take longer is due to saving data into the settings file, which is pretty slow. Also the modem's packet server takes longer due to sending network commands with longer timeouts (to deallocate IP(X) addresses).

The RAM already uses a sort of paging-ish table that's cached by the RAM module (one entry for reads, one entry for writes and one entry for code fetches. All are doubled to provide for DMA as well (for reads and for writes only though). So 5 out of 8 entries used at a time.
Writing to RAM invalidates the read caches that match the locations. It checks for overlap on all CPU caches, thus is very slow.
The VGA RAM and BIOS/Option ROMs are of the 'uncached' type. The BIOS ROM due to being flash (thus stuff included like status bits etc.) and VGA RAM due to being dynamic.
VGA RAM can also be mapped over nornal RAM (like with ET4000 chipsets).
For all cases flash ROM writes invalidate the caches too.

One reason the BIU is relatively slow on IPS clocking mode is that it dumbly fetches data till the PIQ is filled to 12 bytes (max instruction length) each instruction. Thus a lot of instruction reads happen.

The single-address block caching might also be slower due to only having one entry per fetch/read/write type. Thus random access kills it. They essentially constantly get invalidated each read/write on x86, as each instruction addresses a different RAM/ROM block.

Are you talking about the 128-bit things? Or are you talking about cached address translations?

If the latter, I wonder if it would be useful to have an extra cached block for read-modify-write access? It should fit nicely into your existing scheme -- if it's an address translation c ache, that is.

Actually it's both:
- The MMU caches RAM reads/writes/fetches address translation (table generated at the top of the module). That's much like the x86 CPU paging unit's lookups, but compressed to 5 bytes (and 4 or 5 bits of access rights from the top of my head, don't remember exactly).
- The BIU does the same, but for the actual read data instead (up to 128 bits from aligned data, prefixed bytes being discarded until the requested address. For example memory location 1 will cache bytes 1-15, with byte 0 being SHR'ed out. Subsequent reads read from the cache, then any other read location invalidates the entire read cache for the data in the BIU. You'll see that the BIU_directrb spends quite a lot of time on those (mainly due to the PIQ FIFO buffer filling in Dosbox-style IPS clocking mode).
Memory (RAM) writes are heavy too, due to constant BIU invalidation (and checks for overlapping on each BIU read/write/instruction cache).
The reads from RAM and hardware also try to return as aligned data as possible (so 128 bits aligned, else 64 bits aligned, else 32 bits etc.). Usually just the 128 bits alignments are hit. Although reading 2x64 bits in reality, SHRing effectively the low 4 bits of the data in address offsets away (for example index 2+ is SHR 16, index 3+ is SHR 24 etc.). Depending on where the CPU reads from and it's closest alignment.

Edit: The paging-like displacement and flags LUT is 491555 bytes large in total, split into 32 bits of displacement + 1 byte (for 1 upper bit of displacement and 5 out of 7 upper bits used for flags (ROM(bit 6)/unmapped(bit 7) etc.). The value is literally a displacement for the block to be substracted to the memory address (or added, which is another flag, used when the lowest RAM addresses are a special reserved block mappes at FE0000 for example, but placed at the start of the memory buffer due to priorities (reserved memory has higher allocation priority than normal RAM)).
The usual mode is substractive mode, where the LUT value (excluding flags) is substracted from the address to obtain the address into the malloc'ed RAM buffer.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 35 of 50, by peterfirefly

Posted on 2025-02-21, 09:15

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

peterfirefly wrote on 2025-02-20, 20:41:

superfury wrote on 2025-02-20, 20:04:

You can use the parameter fullscreenwindow to make it larger.

Thank you. I'll try that tomorrow 😀

1€  ./UniPCemu fullscreenwindow
2Segmentation fault (core dumped)
3€

Reply 36 of 50, by superfury

Posted on 2025-02-21, 16:34

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5830
Joined: 2014-03-08, 11:25
Location: Netherlands

peterfirefly wrote on 2025-02-21, 09:15:
[…]
Show full quote

peterfirefly wrote on 2025-02-20, 20:41:

superfury wrote on 2025-02-20, 20:04:

You can use the parameter fullscreenwindow to make it larger.

Thank you. I'll try that tomorrow 😀
1€  ./UniPCemu fullscreenwindow
2Segmentation fault (core dumped)
3€

That'd weird. I run it on SDL 2.30.9 / SDL_net 2.2.0 (on Windows 10) without any crashes there? Though that's on Windows 10/11 MSYS2/mingw64 builds...
Edit: Just checked Ubuntu 18.04.6. It runs fine there.
Can't check latest Ubuntu inside Virtualbox, as it crashes during the first boot (with the graphic driver complaining about being unable to run properly). I don't get any segmentation fault.
Perhaps you could set the UNIPCEMU environment variable and point to a writable location (the current directory isn't writable)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 37 of 50, by peterfirefly

Posted on 2025-02-21, 18:42

peterfirefly Offline

Rank Newbie

Rank: Newbie
Posts: 27
Joined: 2011-10-21, 14:45
Location: Copenhagen, Denmark

superfury wrote on 2025-02-21, 16:34:
That'd weird. I run it on SDL 2.30.9 / SDL_net 2.2.0 (on Windows 10) without any crashes there? Though that's on Windows 10/11 M […]
Show full quote
peterfirefly wrote on 2025-02-21, 09:15:
[…]
Show full quote

peterfirefly wrote on 2025-02-20, 20:41:

Thank you. I'll try that tomorrow 😀
1€  ./UniPCemu fullscreenwindow
2Segmentation fault (core dumped)
3€
That'd weird. I run it on SDL 2.30.9 / SDL_net 2.2.0 (on Windows 10) without any crashes there? Though that's on Windows 10/11 MSYS2/mingw64 builds...
Edit: Just checked Ubuntu 18.04.6. It runs fine there.
Can't check latest Ubuntu inside Virtualbox, as it crashes during the first boot (with the graphic driver complaining about being unable to run properly). I don't get any segmentation fault.
Perhaps you could set the UNIPCEMU environment variable and point to a writable location (the current directory isn't writable)?

The current directory when I launch it is writable.
Setting UNIPCEMU to that very same directory makes no difference (whether I add a trailing / or not). I still get a segmentation fault.

Maybe it doesn't like WSL2 -- but in that case it is still doing something wrong. GUI programs run fine under WSL2 unless they are buggy.

Reply 38 of 50, by superfury

Posted on 2025-02-22, 09:05

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5830
Joined: 2014-03-08, 11:25
Location: Netherlands

peterfirefly wrote on 2025-02-21, 18:42:
The current directory when I launch it is writable. Setting UNIPCEMU to that very same directory makes no difference (whether I […]
Show full quote
superfury wrote on 2025-02-21, 16:34:
That'd weird. I run it on SDL 2.30.9 / SDL_net 2.2.0 (on Windows 10) without any crashes there? Though that's on Windows 10/11 M […]
Show full quote
peterfirefly wrote on 2025-02-21, 09:15:
[…]
Show full quote
1€  ./UniPCemu fullscreenwindow
2Segmentation fault (core dumped)
3€
That'd weird. I run it on SDL 2.30.9 / SDL_net 2.2.0 (on Windows 10) without any crashes there? Though that's on Windows 10/11 MSYS2/mingw64 builds...
Edit: Just checked Ubuntu 18.04.6. It runs fine there.
Can't check latest Ubuntu inside Virtualbox, as it crashes during the first boot (with the graphic driver complaining about being unable to run properly). I don't get any segmentation fault.
Perhaps you could set the UNIPCEMU environment variable and point to a writable location (the current directory isn't writable)?
The current directory when I launch it is writable.
Setting UNIPCEMU to that very same directory makes no difference (whether I add a trailing / or not). I still get a segmentation fault.

Maybe it doesn't like WSL2 -- but in that case it is still doing something wrong. GUI programs run fine under WSL2 unless they are buggy.

Perhaps it somehow fails setting up the GUI through SDL? But that shouldn't happen?
Or perhaps a SDL library issue? Can't verify with the latest Ubuntu (due to video driver issues on the latest versions). The GUI doesn't have different code inside UniPCemu accross platforms, so perhaps something else is causing it? Does it also happen without command line parameters or putting something else in there instead? So is it caused by the fullscreenwindow parameter itself or by parameter parsing in general?
Said parameter itself when matched will just set a flag to 1 to make the video layer initialization call some extra SDL functions to perform some automatic scaling. That's inside emu/gpu/gpu.c, inside the common emulator framework, function updateWindow.
It just performs an alternate path on the PSP (IS_PSP defined). All other code is used for all platforms.
It runs fine on my older Ubuntu installation, with and without the parameter.
Maybe a WSL+SDL2+scaling problem inside SDL2?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 39 of 50, by danoon

Posted on 2025-02-22, 16:28

danoon Offline

Rank Member

Rank: Member
Posts: 227
Joined: 2011-01-04, 19:12

@peterfirefly

I answered your questions in a new thread so as to not take over superfury's thread 😀

Boxedwine not running question

https://github.com/danoon2/Boxedwine

Main menu