VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

Are there any free online tutorials etc. that might help with my development of my emulator in some way?

I built the UniPCemu emulator itself using just some basic knowledge (basic C from cplusplus.com (my father taught me Turbo Pascal 6.0 a bit in my childhood (when I was 11 or so). I remember reading through some book which was by PC Intern I think as well back then, which had Pascal and assembly code in it)). Also knew some basic assembly code, but from reading through the 386 programmer's reference manual (up till Pentium), at least wrt to the basic x86 assembly I know that as well (don't know any FPU etc. or MMX etc. though).

So although I know a bit of simple stuff (think linked lists, pointers to functions/objects and arithmetics and some other basics, other than the construct of the language itself), I don't know much of the actual (advanced?) programming stuff. Usually I just look some things up, but perhaps that's not enough to really improve it in a large way (like making it much faster etc., without losing accuracy or functionality).

Is there any free website etc. out there that teaches more advanced programming stuff I'd need to know to become a better programmer for this? Like learning more advanced algorithms etc. that you wouldn't find everywhere already (which are mostly the basics from what I can see from simple googling etc.)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 50, by GloriousCow

User metadata
Rank Member
Rank
Member

I think if you've written a Pentium emulator you're doing pretty good. Most programmers don't know "advanced programming stuff" either by that measurement. Ask your average computer science grad to write a red-black tree from scratch some ten years after graduation!

If performance is your main concern, you can look into profiling tools and methods, and things like SIMD, JIT recompilation and GPU acceleration. These are things where 'free online tutorials' of any sufficient quality are scarce but there is information available.

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Reply 2 of 50, by superfury

User metadata
Rank l33t++
Rank
l33t++
GloriousCow wrote on 2024-07-17, 19:19:

I think if you've written a Pentium emulator you're doing pretty good. Most programmers don't know "advanced programming stuff" either by that measurement. Ask your average computer science grad to write a red-black tree from scratch some ten years after graduation!

If performance is your main concern, you can look into profiling tools and methods, and things like SIMD, JIT recompilation and GPU acceleration. These are things where 'free online tutorials' of any sufficient quality are scarce but there is information available.

Too bad it's broken right now though (the CPU afaik). i4x0 platforms still POST, MS-DOS 6.22 boots, but the XT BIOS can't POST. Windows 9x also got broken (can't boot anymore). Testsuites oddly check out though (test386.asm). And I know it's in recent changes since CPU IRQ handling, SMM improvements (i4x0 only though) and related bugfixes somehow (I think Windows 9x failed to boot sometime after those changes to SMM, but not exactly sure when).

It's weird that the i4x0 POSTs and boots MS-DOS 6.22, but the Generic Super PC/Turbo XT BIOS can't POST at all (crashing before display set-up (video BIOS ROM) is reached, seemingly looping on itself somehow. I assume it's to do with CPU committing behaviour etc. as it's applied now, but can't seem to pinpoint the problem there.

As for why I'm asking about this related to performance, it might be accurate, but still very slow from what I can see related to other emulators (no more than 700 KIPS speed at most(which obviously is in IPS clocking mode, with using the app-measured speed reported multiplied with the emulated instruction speed that's setup with optimal efficiency), XT cycle-accurate mode at no more than ~50% realtime speed with current code(4.77MHz base clock driving everything effectively, as the CPU is the master clock in UniPCemu as it were (the only difference with IPS vs cycle-accurate mode is the timing transfer to other components being changed based on new-instruction state (true/false(instruction executed boolean) giving 1 or 0 cycles) instead of normal ticked cycles(0 or more))), which sounds really slow?

I'm already profiling it a lot, but it effectively keeps boiling down to one set of function call trees (functions and it's called functions): the memory mapped I/O read functions. Those handle memory mapped I/O and installed RAM devices (or simply called mapped RAM blocks).
One of the core culprits there is the RAM memory mapping itself, which is already using small caching for memory translation (using effectively a sort of PAE-ish (ROMmed,present and buffer address only implemented) lookup table that's cached in a 8-entry (effectively 1-entry) cache (4 entries for CPU and 4 for DMA, which are itself only 3 used (1 for data reads, 1 for data writes, 1 for instruction fetches(CPUs only though)))).
So said cache can only cache 1 address block at a time for reads, writes and code fetches (seperated). So it can have block #0 read data, block #1 write data and block #2 code data cached for both any emulated CPU (shared between all cores) and DMA chips emulated(CPU and DMA split due to access rights differences).
Implementing a full CPU-like paging mechanism there for this might be overkill I think? It can be modified to do that though, if it's improving speed. Random access kills the current cache right now (causing a full page walk every different (64K granularity) memory page.
And that's just physical memory. Video memory and other MMIO have their own hooks that are called before in a loop for every memory access and will prevent the RAM layer from being processed if their mapping is used (for example accessing the BIOS (flash) ROM or video RAM mapping or option ROMs, which basically boils down to range checks and specialized handling if it matches).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 50, by peterfirefly

User metadata
Rank Newbie
Rank
Newbie
GloriousCow wrote on 2024-07-17, 19:19:

Ask your average computer science grad to write a red-black tree from scratch some ten years after graduation!

The only hard part of implementing red-black trees is deletion. Sometimes you don't even need/want it because you are using a persistent form + some kind of garbage collection (maybe just arenas that can be discarded/reset en bloc afterwards).

Insert/lookup are remarkably nice to implement in languages with pattern match (ML, Haskell -- and Rust).

I actually played with red black trees once because I felt the invariants were a bit too cumbersome to maintain in C (especially for deletion!) and because I felt that maintaining them took too much time (all that subtree rotation all the time!). I implemented the Andersson's AA trees (a variant with looser invariants) and tried some of my own looser invariants.

A year or two later, Sedgewick came out with Left-leaning red-black trees that were better than what I had come up with.

"I coulda been a contender!"

Reply 4 of 50, by superfury

User metadata
Rank l33t++
Rank
l33t++

Don't even know those red-black trees. I guess they're some kind of hashing-based search algorithm? There's only one of such a lookup in my emulator (CPU TLB), but it's a simple lookup table array (for fast retrieval) combined with a doubly linked list for fast LRU retrieval. There are 2 lookup tables, one of 1MB (for 4KB entries) and 1 of 1KB(for 4MB entries). They point their memory block (1-4GB) to the used cache entry (1 byte data to specify a used entry (+1), with 0 being uncached). Mapping 4KB unmaps 4MB and vice versa (although writing, and during that unmapping 4MB from 4KB is relatively slow, due to full TLB walking it's array (needing to unmap all 4KB relevant entries for the 4MB range)). Although it's only 64 entries to be walked at most (the amount of total TLB entries, matching the top 10 bits of the address only). And that's pf course not including unused entries (only present cached entries).
So it's memory usage is the TLB entries itself, pointers for the doubly linked lists (head and tail), 1MB for 4KB lookups and 2KB for 4MB/2MB lookups.(https://bitbucket.org/superfury/unipcemu/raw/ … rs/cpu/paging.h)

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 50, by peterfirefly

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2025-02-17, 02:04:

Don't even know those red-black trees. I guess they're some kind of hashing-based search algorithm?

No, they are binary trees that are guaranteed to stay balanced (so the path from the root to the branches is the same (ish) no matter which path we choose). An unbalanced binary tree degenerates to a linked list with extra steps. A balanced one has a height of log_2 (no-of-entries), modulo some rounding.

If we only need a static tree, we can build it so it is balanced. Usually, we want a dynamic tree, so we need a mechanism to keep it balanced. We could rebalance the entire tree from scratch every now and then but it turns out it's not that hard to just keep it balanced. Red-black trees is one such type of balanced binary trees. It is called "red-black" because the nodes in the tree have a colour: they are either "red" or "black". Why? Because it is emulating a 2-3 tree or a 2-3-4 tree using only binary nodes. What are those? Balanced trees where the interior nodes can have 2 or 3 (or 4) sub-trees. That makes it a lot easier to keep them balanced because you can often just insert a sub-tree without having to propagate changes upwards to the root -- because you go from 2 subtrees to 3 or from 3 to 4. Occasionally, a node is full so we can't insert an extra sub-tree. In that case, we split it. Red-black trees does the exact same thing but the role of the interior nodes that can have 2-4 sub-trees is played by 1-3 interior nodes instead that collectively do the job. One of the colours -- I can never remember whether it's red or black -- marks the node group as representing a 2-3-4 node, which means that only one of the 1-3 nodes will have that colour. The other nodes in the group (if any) have the other colour.

This is a nice textbook that covers useful standard data structures and algorithms:

https://en.wikipedia.org/wiki/Introduction_to_Algorithms

I have the first edition as a paperback and it looks like I should the new 4th ed as mine is 35 years old now! (I think I've "only" had it for 28 years.)

Reply 6 of 50, by DaveDDS

User metadata
Rank Oldbie
Rank
Oldbie

FWIW, as someone who's written a few emulators(for small systems), If you
don't absolutely need "portability", I highly recommend writing CPU emulation
code in assembly language.

The reason for this is quite simple - one of the hardest things to do when
emulating a CPU is the processor "flags" - and since these are VERY specific
to the CPU arcitecture, I've not seen any high-level/portable language which
provides direct access to them - meaning that after almost every emulated
instruction you have to do a number of tests to determine what flags would
have been set/cleared and store those results accordingly ... this can get
VERY slow, and yes there are some techniques to mitigate this, but IMHO they
can get complex, consume more code/instruction cycles and aren't 100%
effective (all the while causing the emilated CPU speed to "waggle").

--- My experience(s) ---
My first system of any capability was a: MITS Altair - I got late 1970s
with an 8080processor, 64K RAM (eventually), and whopping SSSD (90k) floppy
drives - It was a thing to behold back then... It ran "NorthStar DOS".

and ... I wrote a LOT of software under it... Much of what eventually became
DDS tools etc. started life on my Altair!

A few years later I moved to a homebuilt system based on the Motorola 6809
(a much more capable CPU IMHO) ... I wrote my own DOS for it: CUBIX
and not long after, my first emulator: SIM08 ran 8080 code, and translated
NorthStar DOS calls into CUBIX calls - this let me run almost all of the
code I'd written on the Altair without change... I just had to move it!
---
I eventually wrote full/accurate emulators for both my Altair and D6809
systems... Although I've passed them on... I can still boot up and run my
exact systems "from the day".
---
In doing "Daves Old Computers", I had a few more early systems that I
liked enough to write emulations for:

Mil MOD8 (8008)
Heathkit H8 (8080)
NorthStar Horizon (Z80)

These are accurate emulations of the systems I had, and can boot up and
run the various OSs they supported. (Ok, the MOD8 doesn't have disks/OS,
but it can run Scelbi BASIC)
---
These were all done under DOS - are 16 bit 8086 executable, and run
much faster than the original systems ... I've put in "slow down" options
to reflect "the real experience".

Today, I run them under DosBox ... and:
Windows(I7)->DosBox(8086)->D6809(6809)->SIM80(8080)->AltairBasic
(3 levels of emulation) is STILL faster than the original Altair!
---
To give you an idea of how complex these CPU emulations were, here are
the #lines in the .ASM files (not terribly accurate as these include
copious comments and some debugging features):

SIM80.ASM 1406 (incl NSDOS translation)
8080.ASM 1391 & 1471
6809.ASM 2767
Z80,ASM 2193
8008.ASM 860

---
The most popular of my DDS products was Micro-C .. a C development toolset
targeting very small embedded systems. But some embedded processors are so
small/limited that they don't made good C targets ... rather than do "compiled
stack" and generate a lot of code calling compiler subroutimes, I took a
different approach to accmodate these:

I "invented" my own CPU - "C-FLEA" - two main design goals:
1) be a fairly optimal target for my C compiler
2) be as simple as possible to implement on very small systems.
4 registers (Accumulator, Index, StackPtr, ProgramCounter)
No FLAGS! (ok - 1 hidden one set by CMP)
CMP sets ACC = 0(FALSE) or 1(TRUE)
LT, GT, LE, GE, ULT, UGT, ULE, UGE are instructions following CMP
to do less/greater than comparisons (via hidden flag)

CFLEA.ASM is 773 lines, and this one I did write a version in C (thanks to NO
FLAGS) - <200 lines, and doesn't call any library functions.
---
I used CFLEA to make DVM (Daves Virtual Machine) which I've ported to Win32
and Linux - which lets me take a lot of my existing sources (I still write a
lot of stuff in Micro-C) and compile a version which will run on Windows or
Linux!
---
I also wrote Emily52 a DDS product that was an 8051/52 emulator and in-circuit
debugger - 8052.ASM < 2000 lines.

Dave ::: https://dunfield.themindfactory.com ::: "Daves Old Computers"->Personal

Dave ::: https://dunfield.themindfactory.com ::: "Daves Old Computers"->Personal

Reply 7 of 50, by peterfirefly

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2024-07-16, 22:14:
Are there any free online tutorials etc. that might help with my development of my emulator in some way? […]
Show full quote

Are there any free online tutorials etc. that might help with my development of my emulator in some way?

I built the UniPCemu emulator itself using just some basic knowledge (basic C from cplusplus.com (my father taught me Turbo Pascal 6.0 a bit in my childhood (when I was 11 or so). I remember reading through some book which was by PC Intern I think as well back then, which had Pascal and assembly code in it)). Also knew some basic assembly code, but from reading through the 386 programmer's reference manual (up till Pentium), at least wrt to the basic x86 assembly I know that as well (don't know any FPU etc. or MMX etc. though).

So although I know a bit of simple stuff (think linked lists, pointers to functions/objects and arithmetics and some other basics, other than the construct of the language itself), I don't know much of the actual (advanced?) programming stuff. Usually I just look some things up, but perhaps that's not enough to really improve it in a large way (like making it much faster etc., without losing accuracy or functionality).

Is there any free website etc. out there that teaches more advanced programming stuff I'd need to know to become a better programmer for this? Like learning more advanced algorithms etc. that you wouldn't find everywhere already (which are mostly the basics from what I can see from simple googling etc.)?

I agree completely with the most glorious of all cows (:

GloriousCow wrote on 2024-07-17, 19:19:

I think if you've written a Pentium emulator you're doing pretty good. Most programmers don't know "advanced programming stuff" either by that measurement.

You are doing much better than almost all official comp. sci. graduates. So your code is slow? And it sucks? Theirs is worse!

I have read your entire stream-of-consciousness thread "UniPCemu progress". One thing that really stands out is the number of bugs you have to fight, how hard they are to find, and how often fixing one just causes another bug somewhere else. As you yourself recently wrote:

superfury wrote on 2024-07-18, 02:31:

Too bad it's broken right now though (the CPU afaik). i4x0 platforms still POST, MS-DOS 6.22 boots, but the XT BIOS can't POST. Windows 9x also got broken (can't boot anymore). Testsuites oddly check out though (test386.asm). And I know it's in recent changes since CPU IRQ handling, SMM improvements (i4x0 only though) and related bugfixes somehow (I think Windows 9x failed to boot sometime after those changes to SMM, but not exactly sure when).

Sounds to me like what you really need is better tests and more tests + much better diagnostics + better code structure.

This is something all comp. sci. and software engineering students take classes on but usually with zero positive lasting impact. At "best", they end up cargo culting what they saw in those classes (which are often quite badly taught and often teach bad things).

And yet... it is something that really matters in practice and some people are really good at it whereas most people are really bad at it.

Nobody seems to know how to teach it well 🙁

(Just like nobody seems to know how to teach good prose writing. Journalistic writing can be taught reasonably well but literary writing can't.)

What I can tell you is what worked for me.

Glenford J. Myers, "The Art of Software Testing". The 1st edition is from 1979 and I was lucky enough to stumble upon a Danish translation at the local library as a teen. There is a new edition from 2022 that might be better/more relevant. You can also find scans of the old version on the net or track down a used copy.

Reading that book shocked me. It made it crystal clear that I had insufficient paranoia when it came to my own programming abilities. If you do nothing else, just take the quiz in the beginning of the book (without cheating).

I didn't start writing automated unit tests until some years later, in the early 90's. That taught me a few things:

  1. writing unit tests wasn't hard and it wasn't slow
  2. most of the bugs were actually found during the writing of the tests
  3. the act of writing the tests itself lead to some nice cleanups of the main code
  4. most of the tests could and should be table-driven
  5. once I had unit tests for a module, that module was done. I don't think I ever had another bug in such a module.
  6. error handling code is really hard to test
  7. unit tests should be combined with coverage analysis (to see what actually gets tested)
  8. my modules were sometimes quite hard to test and I had to adjust my coding style to make my code (much!) easier to test
  9. it was super easy/fast/safe to modify my code once it had unit tests
  10. modifying the tests to match was also easy/fast/safe

I also learned a zeroth item: my code was a lot less buggy than most people's -- but it didn't matter. It was still buggy.

I learned a little from the Agile people:

  1. planning is overrated
  2. program development is iterative
  3. it's all about the control loop -- faster iterations beat slow iterations. Good measurements (feedback) beat bad measurements (feedback). Make those two good enough and the "controller" (you) doesn't have to be good.
  4. You Ain't Gonna Need It
  5. Do the simplest thing that might possibly work
  6. refactor
  7. make really, really, really sure you have a test suite
  8. make really, really, really sure you can run the most important subset of it really fast

What I did not learn from the Agile people: Test-Driven Development (or Test-First Development). I think it works fine if you are just modifying things a little bit within a pre-existing structure (but then you can probably do fine without writing the tests first). I think it fails horribly in most cases where the structure isn't there yet (and you don't even have an inkling of how to structure it).

I learned a lot from an article in the November 1992 issue of Computer Language, a long extinct American magazine:

Fisher, Jeff and Gipson, Dale, "In Search of Elegance", Computer Language, November 1992, pp 37 - 46.

I tried to track it down so I could link to it but, alas, no such luck.

And then there's this:

https://en.wikipedia.org/wiki/Code_Complete

A nice book but I wouldn't trust his ideas of how to make C/C++ code look good. The higher-level stuff is good, though.

There's the classic:

https://en.wikipedia.org/wiki/The_Mythical_Man-Month

I'll just give you one quote: "Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."

On cargo culting:

https://en.wikipedia.org/wiki/Cargo_cult
https://calteches.library.caltech.edu/51/2/CargoCult.pdf
https://en.wikipedia.org/wiki/Cargo_cult_programming

Reply 8 of 50, by peterfirefly

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2024-07-16, 22:14:

Is there any free website etc. out there that teaches more advanced programming stuff I'd need to know to become a better programmer for this? Like learning more advanced algorithms etc. that you wouldn't find everywhere already (which are mostly the basics from what I can see from simple googling etc.)?

There's the book on algorithms I linked to above.

There's also the following wiki pages:

https://en.wikipedia.org/wiki/Hash_table
https://en.wikipedia.org/wiki/Bloom_filter
https://en.wikipedia.org/wiki/Sentinel_value
https://en.wikipedia.org/wiki/Sentinel_node
https://en.wikipedia.org/wiki/Radix_sort
https://en.wikipedia.org/wiki/Radix_tree

and:

https://cr.yp.to/critbit.html

Varghese and Lauck, "Hashed and hierarchical timing wheels: data structures for the efficient implementation of a timer facility", 1987
https://dl.acm.org/doi/abs/10.1145/41457.37504

Varghese and Lauck published two articles on timing wheels. The content is about the same but the text is different. I think their second paper started out as an edited version of their first.

superfury wrote on 2024-07-18, 02:31:
I'm already profiling it a lot, but it effectively keeps boiling down to one set of function call trees (functions and it's call […]
Show full quote

I'm already profiling it a lot, but it effectively keeps boiling down to one set of function call trees (functions and it's called functions): the memory mapped I/O read functions. Those handle memory mapped I/O and installed RAM devices (or simply called mapped RAM blocks).
One of the core culprits there is the RAM memory mapping itself, which is already using small caching for memory translation (using effectively a sort of PAE-ish (ROMmed,present and buffer address only implemented) lookup table that's cached in a 8-entry (effectively 1-entry) cache (4 entries for CPU and 4 for DMA, which are itself only 3 used (1 for data reads, 1 for data writes, 1 for instruction fetches(CPUs only though)))).
So said cache can only cache 1 address block at a time for reads, writes and code fetches (seperated). So it can have block #0 read data, block #1 write data and block #2 code data cached for both any emulated CPU (shared between all cores) and DMA chips emulated(CPU and DMA split due to access rights differences).
Implementing a full CPU-like paging mechanism there for this might be overkill I think? It can be modified to do that though, if it's improving speed. Random access kills the current cache right now (causing a full page walk every different (64K granularity) memory page.
And that's just physical memory. Video memory and other MMIO have their own hooks that are called before in a loop for every memory access and will prevent the RAM layer from being processed if their mapping is used (for example accessing the BIOS (flash) ROM or video RAM mapping or option ROMs, which basically boils down to range checks and specialized handling if it matches).

Essentially all non-trivial emulators end up having to do some kind of memory address translation at run-time -- sometimes several times per instruction (code fetch + operands + maybe some DMA in the background). Maybe there is a virtual to physical translation (within the emulated architecture). Maybe there is a non-trivial mapping from the emulated physical addresses to the actual addresses inside the emulator itself. Maybe reads/writes involve actions and therefore involve calling various functions depending on the the address. Maybe the emulator doesn't execute one instruction at a time (because it caches decoding information or JITs), in which case emulated code addresses need to be converted to addresses inside the emulator's data structures AND emulated writes (also from devices) have to be checked to see if they invalidate some of the cached information.

This can easily become a big bottle neck.

And yet, most emulated code doesn't do anything sneaky. The code fetches are normal (and inside the same page as the last code fetch), the data reads are normal (and not from code or devices), the data writes are also normal (not code or devices or non-mapped addresses etc). You should be able to emulate much faster than 700 thousand instructions per second.

Something I find useful sometimes is to single step through slow code. It hammers home how much useless work the code does in a way that a profiler doesn't.

If you want to document the internals of your emulator, then the fetch-decode-execute loop (including the address translations) is your top priority. I looked at the link you gave me but it left me none the wiser. I also looked a bit around on my own and I still don't know what's going on. There are plenty of comments but they are all at an ant's level: "I just passed a stick. There's also a stick over there, next to that stick behind the stick. No, not that stick, the other stick."

Reply 9 of 50, by danoon

User metadata
Rank Member
Rank
Member

Video memory and other MMIO have their own hooks that are called before in a loop for every memory access and will prevent the RAM layer from being processed if their mapping is used (for example accessing the BIOS (flash) ROM or video RAM mapping or option ROMs, which basically boils down to range checks and specialized handling if it matches).

If I read that right, are you really doing a loop for every emulated memory read/write?

Assuming you are only emulating 32-bit, you could create an array of page entries, 1 million large, and put a function pointer or virtual objects in there to handle the read/write. That way for each emulated read/write you don't need to check if it requires special access. But keep in mind, indirect jumps from function pointers or virtual calls are slower. So I have another cache that holds just the raw ram pointer if it doesn't need anything special. I got this idea from Dosbox. My memory read looks like

U32 KMemory::readd(U32 address) {
if ((address & 0xFFF) < 0xFFD) {
int index = address >> 12;

MMU& mmu = data->mmu[index];
if (mmu.canReadRam)
return *(U32*)(&(ramPageGet((RamPage)mmu.ramIndex)[address & 0xFFF]));

return data->mmu[index].getPage()->readd(&data->mmu[index], address);
} else {
return readb(address) | (readb(address + 1) << 8) | (readb(address + 2) << 16) | (readb(address + 3) << 24);
}
}

Because most cases the first if is true, the branch prediction of the host cpu makes that free. And also in a lot of cases mmu.canReadRam is also true, so that is mostly free too. Then the rest of that is just pointer math, no virtual function calls in the happy path. This ends up being very fast.

For special pages, like host mapped memory to do pass through opengl or vulkan, mmu.canReadRam will always be false and I will call the abstract Page class version of read. I do this for pages that have code too so that I can keep track of writes to invalidate the JIT.

https://github.com/danoon2/Boxedwine

Reply 10 of 50, by superfury

User metadata
Rank l33t++
Rank
l33t++

I don't use any c++ though, since not all my toolchains work with it (like for the PSP homebrew devkit).

Each memory call (8/16/32-bit) is passed through 3 stages:
1. MMIO (VGA, BIOS ROM or other registered hardware, using a do while loop). It basically asks the device if it responds, if not, fall through to step 2. Uses function pointers to query a device though. VGA and BIOS calls are hardcoded as higher priority, bypassing the loop (just 2 consecutive function calls with success result). BIOS ROMs are in a loop though (due to multiple ROMs being dynamically mapped).
2. Use a MMU 1-entry paging-like cache to translate the memory address into buffer address for RAM (with flags supplying unmapping, ROM attributes, much like a 1-entry Paging TLB. But it's actually 3 entries, split by access type(1 entry for data read, 1 entry for data write and 1 entry for code read). This is effectively doubled due to providing DMA data reads and write caching too at this level (due to DMA also having seperate access rights to RAM on some chipsets (like i4x0 for some UMA blocks being prohibited)). VRAM is also special cased for DMA (required for some demos like EGA Fantasy Land too, which does floppy FDC->VRAM using DMA).

In all cases (ROM/RAM/MMIO) it tries to read as much as it can (up to 128 bits using 2 64-bit reads). It instructs the BIU to cache read data. Writes check that cache and invalidate it when the range matches. Hardware can identify an read access as uncacheable (for instance any video RAM and APIC) to prevent caching issues (since VRAM is latched and dynamic ((S)VGA memory specifics) and the same for APIC (status registers etc.). The last read data of one of the 2 types (data type or code type) is thus constantly cached into the BIU on reads. Uncachable reads do that too, but modify reads from the cache into a LIFO (cache is used as long as the byte read is the next byte from the previous one, like with word/dword reads). Reading past (actually anything but the next byte (index) or for another address causes the cache to be re-read if it's an uncachable marked entry (thus 32-bit reads from APIC/VRAM(although VRAM is only 1 byte results, thus clearing the cache for every byte) is cached until the read completes only. Although if you can manage to get the BIU to read 2 words from the APIC without adressing a different address in-between you would get both from the cache (due to [n] and [n+2] being consecutive bytes). But word [n] and [n+2] will be seperate reads from real APIC RAM (caching 32-bits(at n and 16-bits (at n+2) respectively) to be discarded afterwards (due to 32-bits aligned being cached and the 16-bit read at n+2 clearing the cache after reading it's second byte.
Basically what the BIU does with 'uncached' accesses is cache it. Then upon every read move the start 1 byte ahead and shift the data to the next byte(thus caching the next byte onwards and 1 junk byte (zero)). It then adjusts the end address of the map to discard that shifted-in byte.
The shifting is done using a custom 128-byte shift-right (SHR in x86) using 2 64-bit dwords. Although relatively 'slow' on 32-bit or 64-bit CPUs, this can cache up to 128 bits for every CPU read, speeding up increasing memory accesses (like BIU fetches(which may be big chunks being loaded in one jump (Dosbox style IPS clocking mode reading up to max instruction length bytes in one go. Cycle accurate mode in nice byte/word/dword chunks though, depending on CPU and BIU (CS segment cache+)'EIP' alignment)) and 16/32/64-bit data reads). 64-bit reads/writes of course only being used in protected mode where applicable (PAE paging for example, and protected mode where needed to use 64-bit chunks (segment descriptors etc.)).
The CPU's Paging TLB and segment descriptor caches work as is known. Segment descriptor information using precalcs and live checks for every access (checking first and last byte only for 16/32/64-bit accesses). Paging checks reduced to a single paging unit check if it's not on 2 pages). Paging basically checks 2 times for code: once in the prefetch (with early abort if not in TLB) and once in every call of an instruction (EU) and memory translation call(BIU). Basically it's made so that the EU part loads it and performs checks (R/O etc.) and the BIU blindly uses it (except if not in the TLB). BIU reaching prefetch without a TLB entry cached makes it halt prefetching for the EU to load it into the TLB. Although neither paging unit loads/stores nor segment descriptors etc. use the BIU to load right now. Although they can be interrupted by bus locking, which always happens if it loads those from RAM (for the duration of the instruction, like XCHG does as well).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 11 of 50, by danoon

User metadata
Rank Member
Rank
Member

As for performance, I tried profilers and they rarely lead me to any thing useful. Sometimes I would just randomly break the program to see what it was doing. If it ended up in the same spot a lot, I would take a closer look at it. But I think you are right to focus on memory. Most of my biggest performance improvements were around memory. Of course the other big one was x86 flag calculation.

That is interesting that PSP Homebrew doesn't support c++. I started out with Java then moved to C for my project. I moved to c++ maybe 10 years ago. It was a big effort.

https://github.com/danoon2/Boxedwine

Reply 12 of 50, by superfury

User metadata
Rank l33t++
Rank
l33t++
danoon wrote on 2025-02-18, 00:47:

As for performance, I tried profilers and they rarely lead me to any thing useful. Sometimes I would just randomly break the program to see what it was doing. If it ended up in the same spot a lot, I would take a closer look at it. But I think you are right to focus on memory. Most of my biggest performance improvements were around memory. Of course the other big one was x86 flag calculation.

That is interesting that PSP Homebrew doesn't support c++. I started out with Java then moved to C for my project. I moved to c++ maybe 10 years ago. It was a big effort.

It could be because I'm using the normal C compiler though (CFLAGS etc instead of CXXFLAGS).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 13 of 50, by peterfirefly

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2025-02-17, 23:23:
I don't use any c++ though, since not all my toolchains work with it (like for the PSP homebrew devkit). […]
Show full quote

I don't use any c++ though, since not all my toolchains work with it (like for the PSP homebrew devkit).

Each memory call (8/16/32-bit) is passed through 3 stages:
1. MMIO (VGA, BIOS ROM or other registered hardware, using a do while loop). It basically asks the device if it responds, if not, fall through to step 2. Uses function pointers to query a device though. VGA and BIOS calls are hardcoded as higher priority, bypassing the loop (just 2 consecutive function calls with success result). BIOS ROMs are in a loop though (due to multiple ROMs being dynamically mapped).
2. Use a MMU 1-entry paging-like cache to translate the memory address into buffer address for RAM (with flags supplying unmapping, ROM attributes, much like a 1-entry Paging TLB. But it's actually 3 entries, split by access type(1 entry for data read, 1 entry for data write and 1 entry for code read). This is effectively doubled due to providing DMA data reads and write caching too at this level (due to DMA also having seperate access rights to RAM on some chipsets (like i4x0 for some UMA blocks being prohibited)). VRAM is also special cased for DMA (required for some demos like EGA Fantasy Land too, which does floppy FDC->VRAM using DMA).

Can you link to the code that implements 1 + 2? Files + line numbers?

My exegesis of the above is this:

stage 0: accesses that originate from the CPU and use virtual addresses have their addresses translated to physical addresses. Some access don't go through CS/DS/ES/SS/FS/GS -- descriptor loads, table walks, funny protected mode stuff -- so they don't go through translation. All accesses that originate from the CPU have their access rights (and bases) checked.

This is done via a software TLB backed by a table walk. The TLB also implicitly adds the segment base (I think).

We are now working with physical addresses.

stage 1: we need to find out what chunk of memory we are accessing + what kind it is. Or if it even is memory at all -- there might not be anything at that address.
A specialized read/write routine is called for the particular chunk and "memory" type at that address. That allows for side effects for memory mapped I/O and for the twisted memory mapping on EGA/VGA cards.
I think page straddling is also handled here. I don't know if all unaligned accesses are handled here.

This is done by first asking the BIOS, VGA, RAM, ROM subsystems if they own the address.

BIOS calls are emulated by having a special memory type for the BIOS and having its read handler switch on the address. This allows interrupt chaining to work (a program hooks an interrupt, the interrupt handler then uses JMPF or PUSHF+CALLF to invoke the original interrupt handler).

I think maybe the same code is used for all accesses? So data reads and data writes also check whether they hit BIOS memory?

I think there are three function classes per memory type:

bool is_it_a_hit(uint32_t addr)
handle_read{8|16|32}(uint32_t addr)
handle_write{8|16|32}(uint32_t addr, uint{8|16|32}_t x)

I might have misunderstood that and there is only a single read and write function per memory type and the different access sizes are handled later by your 128-bit cache.

I think the memory subsystems are asked sequentially:

first BIOS
then VRAM (EGA/VGA/etc)
then RAM, possibly more than once because the RAM chunks aren't contiguous (because of the hole between 640KB and 1MB)
then each extra ROM (without the magic BIOS call trapping)

An alternative would be to use the physical address to find the correct subsystem and just invoke the handle_read/write functions directly.
This could be done by having a sorted array of start addresses for each chunk and then use binary search, followed by a range check once the chunk is found.

https://en.wikipedia.org/wiki/Binary_search

If you are willing to split this code into three versions (fetch, read, write) you can have a last_chunk index that you check first. Only if that fails do you do the binary search.

stage 2: go from physical address + subsystem/chunk identification to a memory pointer (if it's relevant).
I don't know if that happens inside the individual memory subsystems or outside of them.

There is a software TLB involved + a 128-bit cache.

---

The above uses two software TLBs + a lot of (indirect?) function calls + a loop. Why not replace that with a single software TLB?

Why not have a translation in the software TLB from virtual address to a tuple like (physaddr, memptr, flags)? And have flags like "yes, we can write directly to whatever memptr points to" and "yes, we can read directly from whatever memptr points to". Perhaps also a "this entry is valid" flag and "generate exception when executing/reading/writing" flags?

The code danoon sketches above does exactly that (I bet).

It checks if the 32-bit read is guaranteed to be fully inside a page. If not, it takes a really dumb and slow path that is certain to be correct.
If it is, it checks the software TLB to see if we can read from raw memory. If yes, we do just that. No need to even look at the physical address.
If not, we use a function pointer call (in this case a virtual method on whatever object handles this page) to do the work for us.

Easy peasy. Fast. No need for 128-bit cache.

The software TLB in this case is extremely simple: just an array of 2^20 entries, with each entry corresponding to a 4K page (2^12 bytes).
It needs to be a bit bigger if you use PAE but the principle is the same.

This should give you really good performance as long as there aren't a lot of TLB invalidation going on. Old games often run on DOS extenders that set up the mapping once before the game runs so that should be fine.

But if you want to run Windows... especially a trigger happy early 32-bit Windows that does TLB invalidation all the time... It'll be slow!

You can handle that by tracking which parts of the software TLB that needs clearing when the entire TLB is invalidated (write to CR3 + hw task switch).
Just build a linked list of virtual pages tracked by the software TLB. The tracking doesn't have to be precise as long as it doesn't miss any pages. That means we can use clusters of, say, 2^4 pages and only track those clusters. There are 2^16 such clusters (if we support 2^20 pages) so the "pointers" fit in two bytes.

We can handle INVLPG (invalidation of a single page) by either not changing any tracking at all (= by being conservative) or by adding two modifications. One is a 16-bit bit array that indicates which pages inside the cluster are mapped. The other is a way to turn the singly linked list into a doubly linked list, which allows for unlinking a cluster if all its pages have been invalidated. The doubly linked list still only needs a 16-bit pointer if it uses the xor trick:

https://en.wikipedia.org/wiki/XOR_linked_list

Huge pages can be implemented by lazily building up clusters.

If you want to run AMD64 code, you are going to need to track so many more pages. A simple array is completely out of the question -- but a radix tree is not. You should be able to still get the same speed as for 32-bit by tracking the current last-level node (page array) for each type of access (fetch, read, write).

You might get a bit more mileage out of not just having such a single-level cache per access type but by access x segment: fetch, read_cs, read_ds, read_es, read_ss, read_fs, read_gs, write_cs, write_ds, write_es, write_ss, write_fs, write_gs.

Accesses that don't start with a virtual address (descriptor load/protected mode stuff) can't use this software TLB -- but they should be rare enough that it doesn't really matter. You may want to have a secondary software TLB just for those but I doubt it'll be necessary.

Reply 14 of 50, by peterfirefly

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2025-02-18, 12:24:
danoon wrote on 2025-02-18, 00:47:

As for performance, I tried profilers and they rarely lead me to any thing useful. Sometimes I would just randomly break the program to see what it was doing. If it ended up in the same spot a lot, I would take a closer look at it. But I think you are right to focus on memory. Most of my biggest performance improvements were around memory. Of course the other big one was x86 flag calculation.

That is interesting that PSP Homebrew doesn't support c++. I started out with Java then moved to C for my project. I moved to c++ maybe 10 years ago. It was a big effort.

It could be because I'm using the normal C compiler though (CFLAGS etc instead of CXXFLAGS).

GCC really doesn't like to be invoked as 'gcc' for C++ code. The reason is that it is really a wrapper that invokes various other programs and that it generates different command lines for them depending on how it is invoked. It adds extra flags + extra include/library paths for C++. It is possible to get those right manually but 1) it is not easy 2) it isn't stable.

It seems like PSP Homebrew supports Rust. It would be really weird if it doesn't support C++.

Reply 15 of 50, by superfury

User metadata
Rank l33t++
Rank
l33t++
peterfirefly wrote on 2025-02-18, 12:36:
Can you link to the code that implements 1 + 2? Files + line numbers? […]
Show full quote
superfury wrote on 2025-02-17, 23:23:
I don't use any c++ though, since not all my toolchains work with it (like for the PSP homebrew devkit). […]
Show full quote

I don't use any c++ though, since not all my toolchains work with it (like for the PSP homebrew devkit).

Each memory call (8/16/32-bit) is passed through 3 stages:
1. MMIO (VGA, BIOS ROM or other registered hardware, using a do while loop). It basically asks the device if it responds, if not, fall through to step 2. Uses function pointers to query a device though. VGA and BIOS calls are hardcoded as higher priority, bypassing the loop (just 2 consecutive function calls with success result). BIOS ROMs are in a loop though (due to multiple ROMs being dynamically mapped).
2. Use a MMU 1-entry paging-like cache to translate the memory address into buffer address for RAM (with flags supplying unmapping, ROM attributes, much like a 1-entry Paging TLB. But it's actually 3 entries, split by access type(1 entry for data read, 1 entry for data write and 1 entry for code read). This is effectively doubled due to providing DMA data reads and write caching too at this level (due to DMA also having seperate access rights to RAM on some chipsets (like i4x0 for some UMA blocks being prohibited)). VRAM is also special cased for DMA (required for some demos like EGA Fantasy Land too, which does floppy FDC->VRAM using DMA).

Can you link to the code that implements 1 + 2? Files + line numbers?

My exegesis of the above is this:

stage 0: accesses that originate from the CPU and use virtual addresses have their addresses translated to physical addresses. Some access don't go through CS/DS/ES/SS/FS/GS -- descriptor loads, table walks, funny protected mode stuff -- so they don't go through translation. All accesses that originate from the CPU have their access rights (and bases) checked.

This is done via a software TLB backed by a table walk. The TLB also implicitly adds the segment base (I think).

We are now working with physical addresses.

stage 1: we need to find out what chunk of memory we are accessing + what kind it is. Or if it even is memory at all -- there might not be anything at that address.
A specialized read/write routine is called for the particular chunk and "memory" type at that address. That allows for side effects for memory mapped I/O and for the twisted memory mapping on EGA/VGA cards.
I think page straddling is also handled here. I don't know if all unaligned accesses are handled here.

This is done by first asking the BIOS, VGA, RAM, ROM subsystems if they own the address.

BIOS calls are emulated by having a special memory type for the BIOS and having its read handler switch on the address. This allows interrupt chaining to work (a program hooks an interrupt, the interrupt handler then uses JMPF or PUSHF+CALLF to invoke the original interrupt handler).

I think maybe the same code is used for all accesses? So data reads and data writes also check whether they hit BIOS memory?

I think there are three function classes per memory type:

bool is_it_a_hit(uint32_t addr)
handle_read{8|16|32}(uint32_t addr)
handle_write{8|16|32}(uint32_t addr, uint{8|16|32}_t x)

I might have misunderstood that and there is only a single read and write function per memory type and the different access sizes are handled later by your 128-bit cache.

I think the memory subsystems are asked sequentially:

first BIOS
then VRAM (EGA/VGA/etc)
then RAM, possibly more than once because the RAM chunks aren't contiguous (because of the hole between 640KB and 1MB)
then each extra ROM (without the magic BIOS call trapping)

An alternative would be to use the physical address to find the correct subsystem and just invoke the handle_read/write functions directly.
This could be done by having a sorted array of start addresses for each chunk and then use binary search, followed by a range check once the chunk is found.

https://en.wikipedia.org/wiki/Binary_search

If you are willing to split this code into three versions (fetch, read, write) you can have a last_chunk index that you check first. Only if that fails do you do the binary search.

stage 2: go from physical address + subsystem/chunk identification to a memory pointer (if it's relevant).
I don't know if that happens inside the individual memory subsystems or outside of them.

There is a software TLB involved + a 128-bit cache.

---

The above uses two software TLBs + a lot of (indirect?) function calls + a loop. Why not replace that with a single software TLB?

Why not have a translation in the software TLB from virtual address to a tuple like (physaddr, memptr, flags)? And have flags like "yes, we can write directly to whatever memptr points to" and "yes, we can read directly from whatever memptr points to". Perhaps also a "this entry is valid" flag and "generate exception when executing/reading/writing" flags?

The code danoon sketches above does exactly that (I bet).

It checks if the 32-bit read is guaranteed to be fully inside a page. If not, it takes a really dumb and slow path that is certain to be correct.
If it is, it checks the software TLB to see if we can read from raw memory. If yes, we do just that. No need to even look at the physical address.
If not, we use a function pointer call (in this case a virtual method on whatever object handles this page) to do the work for us.

Easy peasy. Fast. No need for 128-bit cache.

The software TLB in this case is extremely simple: just an array of 2^20 entries, with each entry corresponding to a 4K page (2^12 bytes).
It needs to be a bit bigger if you use PAE but the principle is the same.

This should give you really good performance as long as there aren't a lot of TLB invalidation going on. Old games often run on DOS extenders that set up the mapping once before the game runs so that should be fine.

But if you want to run Windows... especially a trigger happy early 32-bit Windows that does TLB invalidation all the time... It'll be slow!

You can handle that by tracking which parts of the software TLB that needs clearing when the entire TLB is invalidated (write to CR3 + hw task switch).
Just build a linked list of virtual pages tracked by the software TLB. The tracking doesn't have to be precise as long as it doesn't miss any pages. That means we can use clusters of, say, 2^4 pages and only track those clusters. There are 2^16 such clusters (if we support 2^20 pages) so the "pointers" fit in two bytes.

We can handle INVLPG (invalidation of a single page) by either not changing any tracking at all (= by being conservative) or by adding two modifications. One is a 16-bit bit array that indicates which pages inside the cluster are mapped. The other is a way to turn the singly linked list into a doubly linked list, which allows for unlinking a cluster if all its pages have been invalidated. The doubly linked list still only needs a 16-bit pointer if it uses the xor trick:

https://en.wikipedia.org/wiki/XOR_linked_list

Huge pages can be implemented by lazily building up clusters.

If you want to run AMD64 code, you are going to need to track so many more pages. A simple array is completely out of the question -- but a radix tree is not. You should be able to still get the same speed as for 32-bit by tracking the current last-level node (page array) for each type of access (fetch, read, write).

You might get a bit more mileage out of not just having such a single-level cache per access type but by access x segment: fetch, read_cs, read_ds, read_es, read_ss, read_fs, read_gs, write_cs, write_ds, write_es, write_ss, write_fs, write_gs.

Accesses that don't start with a virtual address (descriptor load/protected mode stuff) can't use this software TLB -- but they should be rare enough that it doesn't really matter. You may want to have a secondary software TLB just for those but I doubt it'll be necessary.

Look into:
https://bitbucket.org/superfury/unipcemu/src/ … mu/mmuhandler.c
function MMU_INTERNAL_directrb_realaddr handles a read from memory (the main offender due to it's heavy checks and memory accesses).
MMU_IO_readhandler within it handles the MMIO space (BIOS ROM, Video RAM, option ROMs). It also has extra logic for architecuture-specific handlings of memory order etc. (like some things taking priority, split memory spaces etc.).
MMU_INTERNAL_directrb handles the actual RAM accesses (which are 3 blocks essentially: one at 0KB+, one at 1MB+ and one at 16MB+). It's also a pointer to a function to handle slow debugger accesses (with logging) vs fast cached accesses for normal use (not debugging).

The main ROM handling is found inside bios/biosrom.c (function BIOS_readhandler).

Those read functions return the data read (1 byte), with the full up-to-128-bits read being stored inside a global variable for the CPU to read and cache into itself (function BIU_directrb inside cpu/biu.c).
(S)VGA memory reads (expensive though almost uncalled) are inside hardware/vga/mmu.c (function VGAmemIO_rb).

They all return up to 128-bit values from RAM (or 1 byte for VGA and 4 bytes for APIC (both being uncacheable)).

The CPU(EU/BIU) itself has a paging unit TLB which is very fast from what I see (way faster than the direct memory read/write functions, seeing as the hot path is on the memory read/write functions in the BIU instead of those TLB lookups (which are before that)). It's basically a doubly linked list with head and tail pointing to the MRU and LRU items. Of course it's 8x4 actually (as 2 address bits select one set of 8 entries for every possible bit combination of the address (because it's a 4-way TLB)). Those are used for fast cache clearing (head pointer) and the LRU retrieval (when no free entries). Two sets of head/tail pointers are kept for every way. One head&tail pointer pair points to it's allocated(cached) entries and the other pair to it's free entries. Allocating one movies it from the free pair to the head of the used pair (updating the head/tails and the old head chain). Freeing moves an entry back to the free list in the same way (though it's moved to the head as well it doesn't matter if it's on the head or tail end, as it's freed).
Another big table (1MB table (for 4KB entries) and 4KB entry (for 2MB/4MB entries)) maps from a 4KB or a 2MB/4MB address to one of the used entries directly (by supplying a raw TLB entry index). Index 0 meaning unmapped, 1+ meaning array entry+1 (so 1 is entry #0). The pointer of the entry is then used to update to MRU (by moving it to the head of the list).
All those lookup table entries are simple byte values to point to one of the 64 entries in the TLB directly to get it's data.
The initialization function of the CPU creates the list at the same time as the entries are initialized, while storing the list number (the set number, where 2 sets are used for 4KB and 2/4MB respectively. So raw entry indexes 0-31 are 4KB and 32-63 are 2/4MB). Each raw entry has a pointer to the stored TLB data (it's address, protection data etc.). It also has the next/previous links for the MRU/LRU handling and a byte indicating what way it's stored in(again to update the lists properly).
Special care is performed for 4KB vs 2/4MB conflicts. When mapping either in (writing an TLB entry, not when updating MRU), the range of the opposite (4KB for 2/4MB and 2/4MB for 4KB) entries are quickly scanned using the used linked lists with a mask supplied for fast matching and returning them to the free pool.
Said unmapping and freeing is simply a matter of clearing it's byte in the lookup table (to unmap it's address mapping lookup) and moving the entry from the used linked list to the free linked list of it's set (to make it allocatable again).
Oh, and the linked list item also contains a flag indicating if it's currently on the free or used list (to move it properly during mapping, freeing and MRU movements).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 16 of 50, by superfury

User metadata
Rank l33t++
Rank
l33t++

Also, slightly related, replacing much of those caches with added function pointers and flags is too much. The PSP only has some 16 or so MB of memory, it barely has any memory left with all hardware installed (and a small 1MB soundfont too), so I'm already trying to conserve memory as much as possible while trying to make it faster.

Also, the emulated CPU's own paging unit might be fast, but it's seperated from the hardware subsystems like RAM, ROM, MMIO etc. Also there's a special case of hardware being mapped on top of other hardware (effectivly: APIC > MMIO(ROMS&VGA) > RAM), so if any of those don't prevent the lower tiers when it responds, issues happen. Not even mentioning the i4x0 chipset's split of RAM space and MMIO(BIOS,VGA and plugin cards) and special memory handling (for areas A0000-FFFFF) with write protection. And all being controlled in various hardware at a very small paging level (think 32KB from what I remember). And BIOS and hardware accesses can mingle with those addresses too, mapping over it responding to reads but not writes for example. Might be used for ROM shadowing for example (although i4x0 just move to low RAM and extract the ROM, as it's actually compressed I think). Compaq Deskpro 386 just blatantly copies it from high (rev J.6 ROM (from PCJS, really helped with my first 386 implementation and first 32-bit apps) at FFFE0000) addresses to low ones (F0000+) I think.

Also, so far I had managed to get Windows 9x and 3.x running again without (visible) errors.
Didn't check if NT 4 runs again now.
MS-DOS still runs. As does the XT 8088 I think from what I remember. The cause for that was somewhere inside the fetching I think (when adjusting the EU/BIU algorithms to become more accurate).
OS/2 warp 4 seems to run, but when starting to copy files over during the GUI part of setup (directly after configuring networking) it gives me a cmd.exe crash of some dll. After terminating it the setup basically becomes a screen with a mouse cursor that can only use the mouse to move and keyboard ctrl-alt-del to reboot the machine.

Currently I'm running Sandsifter on it to verify the instruction fetching is behaving properly for all instructions, just in case. That'll take quite a bit of time (probably about 2 days), as it's running the 3MIPS CPU at 10% realtime speed (emulation parsing 1/10 of a second during each second), thus running a 3MIPS at 10% realtime speed shifting. So about 300KIPS when converting to actual emulated speed each real second, including other hardware (mostly VGA doing a static screen (some 10% CPU usage) and blinking cursor xd). The CPU emulation is using about 30% CPU usage I think, with all other hardware and main execution loop the remainder. According to the last time I profiled it at least.

Last edited by superfury on 2025-02-19, 01:47. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 17 of 50, by danoon

User metadata
Rank Member
Rank
Member

That is pretty impressive to get anything to work with 16MB of RAM. I'm happy to get notepad running with Wine only using 300MB 😀

https://github.com/danoon2/Boxedwine

Reply 18 of 50, by superfury

User metadata
Rank l33t++
Rank
l33t++
danoon wrote on 2025-02-19, 01:31:

That is pretty impressive to get anything to work with 16MB of RAM. I'm happy to get notepad running with Wine only using 300MB 😀

Tho 16MB is an estimation. It depends on the executable size and PSP model. Newer models have an extra 32MB afaik. The original model has about 4MB on the executable I think, with about 20MB used for all memory allocations (about 1MB RAM left for the MMU I think, I'd need to verify to make sure how much exactly. UniPCemu keeps about 1MB free for dynamic stuff required during runtime (like file allocations and such for file opening, multithreading and it's stack etc.). So eventually only a few MB if any left for the mapped RAM (which is the lowest priority allocation, malloc-ed at the final point of all hardware).
Soundfonts also substract from the RAM though, basically 1MB with the AWE GM one, that could be saved if disabled.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 19 of 50, by peterfirefly

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2025-02-18, 21:34:
Look into: https://bitbucket.org/superfury/unipcemu/src/ … mu/mmuhandler.c function MMU_INTERNAL_directrb_realaddr handles a rea […]
Show full quote

Look into:
https://bitbucket.org/superfury/unipcemu/src/ … mu/mmuhandler.c
function MMU_INTERNAL_directrb_realaddr handles a read from memory (the main offender due to it's heavy checks and memory accesses).
MMU_IO_readhandler within it handles the MMIO space (BIOS ROM, Video RAM, option ROMs). It also has extra logic for architecuture-specific handlings of memory order etc. (like some things taking priority, split memory spaces etc.).
MMU_INTERNAL_directrb handles the actual RAM accesses (which are 3 blocks essentially: one at 0KB+, one at 1MB+ and one at 16MB+). It's also a pointer to a function to handle slow debugger accesses (with logging) vs fast cached accesses for normal use (not debugging).

The main ROM handling is found inside bios/biosrom.c (function BIOS_readhandler).

Those read functions return the data read (1 byte), with the full up-to-128-bits read being stored inside a global variable for the CPU to read and cache into itself (function BIU_directrb inside cpu/biu.c).
(S)VGA memory reads (expensive though almost uncalled) are inside hardware/vga/mmu.c (function VGAmemIO_rb).

Thank you! I'll take a look. I just made a local checkout of the repo and I'll try to see if can understand the code better.