superfury wrote on 2025-02-17, 23:23:I don't use any c++ though, since not all my toolchains work with it (like for the PSP homebrew devkit). […]
Show full quote
I don't use any c++ though, since not all my toolchains work with it (like for the PSP homebrew devkit).
Each memory call (8/16/32-bit) is passed through 3 stages:
1. MMIO (VGA, BIOS ROM or other registered hardware, using a do while loop). It basically asks the device if it responds, if not, fall through to step 2. Uses function pointers to query a device though. VGA and BIOS calls are hardcoded as higher priority, bypassing the loop (just 2 consecutive function calls with success result). BIOS ROMs are in a loop though (due to multiple ROMs being dynamically mapped).
2. Use a MMU 1-entry paging-like cache to translate the memory address into buffer address for RAM (with flags supplying unmapping, ROM attributes, much like a 1-entry Paging TLB. But it's actually 3 entries, split by access type(1 entry for data read, 1 entry for data write and 1 entry for code read). This is effectively doubled due to providing DMA data reads and write caching too at this level (due to DMA also having seperate access rights to RAM on some chipsets (like i4x0 for some UMA blocks being prohibited)). VRAM is also special cased for DMA (required for some demos like EGA Fantasy Land too, which does floppy FDC->VRAM using DMA).
Can you link to the code that implements 1 + 2? Files + line numbers?
My exegesis of the above is this:
stage 0: accesses that originate from the CPU and use virtual addresses have their addresses translated to physical addresses. Some access don't go through CS/DS/ES/SS/FS/GS -- descriptor loads, table walks, funny protected mode stuff -- so they don't go through translation. All accesses that originate from the CPU have their access rights (and bases) checked.
This is done via a software TLB backed by a table walk. The TLB also implicitly adds the segment base (I think).
We are now working with physical addresses.
stage 1: we need to find out what chunk of memory we are accessing + what kind it is. Or if it even is memory at all -- there might not be anything at that address.
A specialized read/write routine is called for the particular chunk and "memory" type at that address. That allows for side effects for memory mapped I/O and for the twisted memory mapping on EGA/VGA cards.
I think page straddling is also handled here. I don't know if all unaligned accesses are handled here.
This is done by first asking the BIOS, VGA, RAM, ROM subsystems if they own the address.
BIOS calls are emulated by having a special memory type for the BIOS and having its read handler switch on the address. This allows interrupt chaining to work (a program hooks an interrupt, the interrupt handler then uses JMPF or PUSHF+CALLF to invoke the original interrupt handler).
I think maybe the same code is used for all accesses? So data reads and data writes also check whether they hit BIOS memory?
I think there are three function classes per memory type:
bool is_it_a_hit(uint32_t addr)
handle_read{8|16|32}(uint32_t addr)
handle_write{8|16|32}(uint32_t addr, uint{8|16|32}_t x)
I might have misunderstood that and there is only a single read and write function per memory type and the different access sizes are handled later by your 128-bit cache.
I think the memory subsystems are asked sequentially:
first BIOS
then VRAM (EGA/VGA/etc)
then RAM, possibly more than once because the RAM chunks aren't contiguous (because of the hole between 640KB and 1MB)
then each extra ROM (without the magic BIOS call trapping)
An alternative would be to use the physical address to find the correct subsystem and just invoke the handle_read/write functions directly.
This could be done by having a sorted array of start addresses for each chunk and then use binary search, followed by a range check once the chunk is found.
https://en.wikipedia.org/wiki/Binary_search
If you are willing to split this code into three versions (fetch, read, write) you can have a last_chunk index that you check first. Only if that fails do you do the binary search.
stage 2: go from physical address + subsystem/chunk identification to a memory pointer (if it's relevant).
I don't know if that happens inside the individual memory subsystems or outside of them.
There is a software TLB involved + a 128-bit cache.
---
The above uses two software TLBs + a lot of (indirect?) function calls + a loop. Why not replace that with a single software TLB?
Why not have a translation in the software TLB from virtual address to a tuple like (physaddr, memptr, flags)? And have flags like "yes, we can write directly to whatever memptr points to" and "yes, we can read directly from whatever memptr points to". Perhaps also a "this entry is valid" flag and "generate exception when executing/reading/writing" flags?
The code danoon sketches above does exactly that (I bet).
It checks if the 32-bit read is guaranteed to be fully inside a page. If not, it takes a really dumb and slow path that is certain to be correct.
If it is, it checks the software TLB to see if we can read from raw memory. If yes, we do just that. No need to even look at the physical address.
If not, we use a function pointer call (in this case a virtual method on whatever object handles this page) to do the work for us.
Easy peasy. Fast. No need for 128-bit cache.
The software TLB in this case is extremely simple: just an array of 2^20 entries, with each entry corresponding to a 4K page (2^12 bytes).
It needs to be a bit bigger if you use PAE but the principle is the same.
This should give you really good performance as long as there aren't a lot of TLB invalidation going on. Old games often run on DOS extenders that set up the mapping once before the game runs so that should be fine.
But if you want to run Windows... especially a trigger happy early 32-bit Windows that does TLB invalidation all the time... It'll be slow!
You can handle that by tracking which parts of the software TLB that needs clearing when the entire TLB is invalidated (write to CR3 + hw task switch).
Just build a linked list of virtual pages tracked by the software TLB. The tracking doesn't have to be precise as long as it doesn't miss any pages. That means we can use clusters of, say, 2^4 pages and only track those clusters. There are 2^16 such clusters (if we support 2^20 pages) so the "pointers" fit in two bytes.
We can handle INVLPG (invalidation of a single page) by either not changing any tracking at all (= by being conservative) or by adding two modifications. One is a 16-bit bit array that indicates which pages inside the cluster are mapped. The other is a way to turn the singly linked list into a doubly linked list, which allows for unlinking a cluster if all its pages have been invalidated. The doubly linked list still only needs a 16-bit pointer if it uses the xor trick:
https://en.wikipedia.org/wiki/XOR_linked_list
Huge pages can be implemented by lazily building up clusters.
If you want to run AMD64 code, you are going to need to track so many more pages. A simple array is completely out of the question -- but a radix tree is not. You should be able to still get the same speed as for 32-bit by tracking the current last-level node (page array) for each type of access (fetch, read, write).
You might get a bit more mileage out of not just having such a single-level cache per access type but by access x segment: fetch, read_cs, read_ds, read_es, read_ss, read_fs, read_gs, write_cs, write_ds, write_es, write_ss, write_fs, write_gs.
Accesses that don't start with a virtual address (descriptor load/protected mode stuff) can't use this software TLB -- but they should be rare enough that it doesn't really matter. You may want to have a secondary software TLB just for those but I doubt it'll be necessary.