VOGONS


First post, by superfury

User metadata
Rank l33t
Rank
l33t

I notice in my profiler that most execution time is actually seeming to be spent purely on handling memory accesses in UniPCemu. So that's mostly prefetching and memory checking itself(protection checks), according to the profiler. Both are already heavily optimized to use as much precalculated data as possible, but the compiler still says they're the hot paths it seems? And that's the emulator running at 3MIPS on a i7-4790K@4GHz CPU, which can run Dosbox at the same speed(3MIPS) with only a few percent CPU usage, while UniPCemu takes a whopping 15-20% CPU time for it to work!

Anyone has any tips on how to optimize it further? I basically already precalculated all values used whereever possible afaik. One thing that does in the hot path make it a bit heavy is that IPS clocking mode(vs cycle-accurate mode) keeps the prefetch buffer fully filled for each instruction, causing relatively many RAM accesses to keep it filled up, but that's required for software that's using self-modifying code.

Last edited by superfury on 2020-03-19, 22:26. Edited 1 time in total.

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 1 of 13, by Stenzek

User metadata
Rank Newbie
Rank
Newbie

Cache page table entries in a TLB-like data structure. This made a significant difference in my x86 emulator.

However, if you're going for "true" emulation then you're limited to what the CPU you're emulating has in terms of a TLB.

Reply 2 of 13, by danoon

User metadata
Rank Member
Rank
Member

I can speak for Dosbox a little bit, one thing they do to speed up memory access is to check if a 2 or 4 byte read/write crosses a page boundary. If it does, then it will be split up into separate 1 byte read/write. If it is on the same page, then all the permission checks, etc, will only be performed once.

http://www.boxedwine.org/

Reply 3 of 13, by superfury

User metadata
Rank l33t
Rank
l33t

I'm already using a doubly-linked list for the TLB's emulation, which sped it up quite a bit. That's just reading a pointer for most accesses, only moving other items to the head of the queue when it's not the most recent(or allocated at all) entry. It will still slowdown a bit in protected mode, but that's mainly due to the memory access overhead for it's structures(descriptor fetching(8 bytea)&writeback(1-byte for writeback) and paging 4-byte entries).

The main issue is mainly the way it tries to keep the prefetch queue fully filled for each instruction(in IPS clocking mode, which uses Dosbox-style clocking and effectively disables the cycle-accurate BIU, performing it all in one go when handling the physical memory/hardware access, in up to 4(dword) seperate byte accesses(to allow for hardware mapped at any aligned or unaligned access)), mainly to allow for SMC(Self-Modifying Code).

Also, the prefetching process during IPS clocking mode tries to reduce it's memory checks as much as possible(first byte(at CS:EIP or next instruction), last byte(if different) within segment limits and perhaps the page before it if that fails(not mapped in TLB))),

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 4 of 13, by superfury

User metadata
Rank l33t
Rank
l33t

I know that for each instruction executed, it will prefetch the instruction length's worth of bytes from memory when starting the next instruction, to keep the prefetch buffer filled.

So at worst, a 12-byte instruction will make it fetch 12 bytes from memory, filling the PIQ.
The worst case scenario happens with jumps, calls and interrupts, which flush the entire queue, making it fetch the entire queue of data(80386 being 16, 80486 being 32, Pentium being a whopping 64!).

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 5 of 13, by superfury

User metadata
Rank l33t
Rank
l33t

Just optimized the prefetching process to not fetch more than the maximum instruction length(as supported by the protection mechanism) when handling IPS clocking mode on a 80286+.

So a 80286 will only fetch until it reaches 10 bytes filled in the PIQ, a 80386 and up with 15 bytes in the PIQ(instead of 16(80386), 32(80486) or 64(Pentium!)).

That should improve speed by a little bit(quite a bit on 80486 and Pentium!).

Edit: Only sped it up a little bit. Now at 3 MIPS, it runs at a speed of around 19-20% running Windows 95(Intel i7-4790K@4.0GHz). Although sometimes it will dip a bit, going no lower than 15%(depending on hard disk accesses and possibly other factors). It tops out at 36% still.

Tried playing some videos from the Windows 95 CD-ROM(at 32K colors ET4000). That ran at about the same speed.

POST at 30%, dipping to 25% when doing quick display updates until reaching 9.9MB checked, then 34-36% until 16MB RAM checked. Then 29-32% during FDC checks. It jumps to ~48% when counting down the Plop BIOS ROM timer(for pressing D to disable it, it's probably using HLT or something like that to wait for interrupts?). Then at 27-29% waiting for the F1 key to resume booting(1782-Disk Controller Failure due to the XT-IDE 80386 ROM(Which is a known issue of that ROM)). Then 27% until booting Windows 95.

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 6 of 13, by Stenzek

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2020-03-18, 10:01:

I'm already using a doubly-linked list for the TLB's emulation, which sped it up quite a bit. That's just reading a pointer for most accesses, only moving other items to the head of the queue when it's not the most recent(or allocated at all) entry. It will still slowdown a bit in protected mode, but that's mainly due to the memory access overhead for it's structures(descriptor fetching(8 bytea)&writeback(1-byte for writeback) and paging 4-byte entries).

Could you use an array/lookup table here instead? While yeah, you get the temporal locality when you hit the same page a bunch of times, as soon as it's anything else that's a lot of work to scan the list when compared to a table lookup.

Reply 7 of 13, by superfury

User metadata
Rank l33t
Rank
l33t
Stenzek wrote on 2020-03-20, 11:18:
superfury wrote on 2020-03-18, 10:01:

I'm already using a doubly-linked list for the TLB's emulation, which sped it up quite a bit. That's just reading a pointer for most accesses, only moving other items to the head of the queue when it's not the most recent(or allocated at all) entry. It will still slowdown a bit in protected mode, but that's mainly due to the memory access overhead for it's structures(descriptor fetching(8 bytea)&writeback(1-byte for writeback) and paging 4-byte entries).

Could you use an array/lookup table here instead? While yeah, you get the temporal locality when you hit the same page a bunch of times, as soon as it's anything else that's a lot of work to scan the list when compared to a table lookup.

Won't an full array of memory pages be inaccurate compared to a real CPU? UniPCemu currently has about 8 pages for each memory area(depending on the high bits of the linear memory address).

Most of the page allocation/deallocation/LRU is just a call to this function:

//Move a TLB entry index from an old list to a new list!
OPTINLINE void Paging_moveListItem(TLB_ptr *listitem, TLB_ptr **newlist_head, TLB_ptr **newlist_tail, TLB_ptr **oldlist_head, TLB_ptr **oldlist_tail)
{
if (likely(*newlist_head == listitem)) return; //Don't do anything when it's already at the correct spot!

//First, remove us from the old head list!
if (listitem->prev) //Do we have anything before us?
{
((TLB_ptr *)listitem->prev)->next = listitem->next; //Remove us from the previous item of the list!
}
else //We're the head, so remove us from the list!
{
*oldlist_head = listitem->next; //Remove us from the head of the list and assign the new head!
}

if (listitem->next) //Did we have a next item?
{
((TLB_ptr *)listitem->next)->prev = listitem->prev; //Remove us from the next item of the list!
}
else //We're the tail?
{
*oldlist_tail = listitem->prev; //Remove us from the tail of the list and assign the new tail!
}

listitem->next = NULL; //We don't have a next!
listitem->prev = NULL; //We don't have a previous!

/* Now, we're removed from the old list and a newly unmapped item! */

//Now, insert us into the start of the new list!
if (*newlist_head) //Anything in the new list already?
{
(*newlist_head)->prev = listitem; //We're at the start of the new list, so point the head to us, us to the head and make us the new head!
listitem->next = *newlist_head; //Our next is the old head!
*newlist_head = listitem; //We're the new head!
}
else //We're the new list?
{
*newlist_head = listitem; //We're the new head!
*newlist_tail = listitem; //We're the new tail!
}
}

It then manages them using the following functions:

OPTINLINE TLB_ptr *allocTLB(sbyte set) //Allocate a TLB entry!
{
TLB_ptr *result;
if (CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]) //Anything available?
{
result = CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]; //What item are we allocating, take it from the free list!
//Now take the item from the pool and move it to the used list!
CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]->allocated = 1; //We're allocated now!
Paging_moveListItem(result, //What item to take!
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set], //destination head
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set], //destination tail
&CPU[activeCPU].Paging_TLB.TLB_freelist_head[set], //source head
&CPU[activeCPU].Paging_TLB.TLB_freelist_tail[set]); //source tail
return result; //Give the result!
}
return NULL; //Nothing to allocate!
}

OPTINLINE void freeTLB(sbyte set, TLB_ptr *listitem) //Make an entry available again!
{
if (listitem->allocated) //Are we allocated at all?
{
listitem->allocated = 0; //Mark us as freed!
Paging_moveListItem(listitem, //What item to take!
&CPU[activeCPU].Paging_TLB.TLB_freelist_head[set], //destination head
&CPU[activeCPU].Paging_TLB.TLB_freelist_tail[set], //destination tail
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set], //source head
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]); //source tail
}
}

OPTINLINE void Paging_setNewestTLB(sbyte set, TLB_ptr *listitem) //Tick an TLB entry for making it the most recently used!
{
if (listitem->allocated) //Are we allocated at all?
{
Paging_moveListItem(listitem,
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set],
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set],
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set],
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]); //Move us to the start of the TLB used list to mark us as the most recently accessed!
}
else //We're not allocated, but marked as newest? Allocate us!
{
//Now take the item from the pool and move it to the used list!
listitem->allocated = 1; //We're allocated now!
Paging_moveListItem(listitem, //What item to take!
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set], //destination head
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set], //destination tail
&CPU[activeCPU].Paging_TLB.TLB_freelist_head[set], //source head
&CPU[activeCPU].Paging_TLB.TLB_freelist_tail[set]); //source tail
}
}

OPTINLINE TLBEntry *Paging_oldestTLB(sbyte set) //Find a TLB to be used/overwritten!
{
TLB_ptr *listentry;
if (CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]) //Anything not allocated yet?
{
if ((listentry = allocTLB(set))) //Allocated from the free list?
{
Show last 18 lines
			return listentry->entry; //Give the index of the resulting entry that's been allocated!
}
}
else //Allocate from the tail(LRU), since everything is allocated!
{
if (CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]) //Gotten a tail? We're used, so take the LRU!
{
listentry = CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]; //What entry to take: the LRU!
Paging_setNewestTLB(set, listentry); //This is the newest TLB now!
return listentry->entry; //What index is the LRU!
}
}
byte indexsize, whichentry;
indexsize = /*(8>>((EMULATED_CPU>=CPU_80486)?1:0))*/ 4;
whichentry = (set*indexsize)+3; //Which one?

return &CPU[activeCPU].Paging_TLB.TLB[whichentry]; //Safety: return the final entry! Shouldn't happen under normal circumstances.
}

Then, reading the TLB for a given memory address is the following:

//RWDirtyMask: mask for ignoring set bits in the tag, use them otherwise!
byte Paging_readTLB(byte *TLB_way, uint_32 logicaladdress, byte W, byte U, byte D, byte A, byte S, uint_32 WDMask, uint_32 *result, byte updateAges)
{
INLINEREGISTER uint_32 TAG, TAGMask;
INLINEREGISTER TLB_ptr *curentry;
sbyte TLB_set;
TLB_set = Paging_TLBSet(logicaladdress,S); //Auto set?
curentry = CPU[activeCPU].Paging_TLB.TLB_usedlist_head[TLB_set]; //What TLB entry to apply?
if (likely(curentry)) //Valid entries to search?
{
TAGMask = ~WDMask; //Store for fast usage to mask the tag bits unused off!
TAGMask &= (curentry->entry->addrmask | 0xFFF); //The full search mask, with the address width(KB vs MB) applied!
TAG = Paging_generateTAG(logicaladdress, W, U, D, A, S); //Generate a TAG!
TAG &= TAGMask; //Premask the search tag for faster comparison!

for (; curentry;) //Check all entries that are allocated!
{
if (likely((curentry->entry->TAG&TAGMask) == TAG)) //Found and allocated?
{
*result = curentry->entry->data; //Give the stored data!
Paging_setNewestTLB(TLB_set, curentry); //Set us as the newest TLB!
if (unlikely(TLB_way)) //Requested way?
{
*TLB_way = curentry->index; //What way was found!
}
return 1; //Found!
}
curentry = (TLB_ptr *)(curentry->next); //Next entry!
}
}
return 0; //Not found!
}

That's most of the active work on the paging mechanism during the usage of already loaded TLB memory. It's just a call to Paging_readTLB to get the most recent(usually) or perhaps 7 less recent(when it's fully filled) or 8 when not in the TLB. Usually., during most memory accesses, the found case is reached(unless it's not the most recent, in which case it's up to 7 more entries being prcessed, 8 if it's not found while it's full(and 0 when not found and the TLB set is empty).

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 8 of 13, by superfury

User metadata
Rank l33t
Rank
l33t

Also, what I'm currently testing on mostly is the BIOS POST running(which doesn't enable paging at all). I already see 36.58% being spent on executing the virtual CPU, of which 16.99% on reading opcodes and prefixes(and 12.37% on executing the instructions themselves), of which 12.79% fetching the opcode bytes themselves, mostly 4.37% on memory access checking(including non-paging memory for 1.99%, the prefetch queue(0.48%), keeping the current opcode 0.08% and 0.13% on determining the current default operand/address sizes for the instruction). Of said opcode byte fetching, 9.17% is spent purely on fetching the opcodes from memory to the PIQ.

Said function(BIU_dosboxTick, for 9.17%), 3.52% is it's own code, 2.33% for the bus access for memory(anything read from RAM/ROM/MMIO), 1.22% for checking the paging unit's address(checkMMUaccess(), which calls the mmu's checkDirectMMUaccess(), which does nothing for this code(paging is disabled, so it simply returns 0). Said function will usually call CPU_Paging_checkPage, which will in turn call readTLB up to 2 times(once for 4MB(Pentium only) and once for 4KB pages, finishing on the very first one found)). Then there's 0.94% writing the PIQ data to the queue,0.47% checking the PIQ's free size and finally 0.51% calculating the linear memory address from the segment:offset pair.

Then, of the bus access(2.33%), it becomes 3.12% including normal instruction memory reads, of which 1.44% is reading the BIOS ROM in memory. That's one function executing only: BIOS_readhandler.Of which 0.10% is determining it's the ROM, 0.08 checking for a linear ROM(custom ROM), 0.12% determining the ROMs used(U18/19, U13/15, U27/47 or U34/35), 0.12% checking for the doubled ROM substraction, 0.08% determining the ROM number(13 or 15) and checking for out-of-range of the allocated ROM, finally(the most heavy part) 0.27% reading the byte from the ROM and storing it in the result, 0.47% returning success(which is a simply "return 1;") to make the MMU not map to RAM or NULL(which is the case with the ROM area) instead.

Edit: Just saw a little optimization(just a little one): The check for the PIQ's free size is done twice(once at the start, once further down). Those can be easily combined into 1 call only(it doesn't change in the meanwhile anyway).
Edit: Said optimization brings checking the PIQ free size down to 0.27%(previously 0.47%).
The function itself is now brought down to 2.83% for it's own contents, 7.6% for the entire function(previously 9.17% for the entire function and 3.52% for it's contents respectively).

Still, 7.6% is still almost 1/10th of the entire execution time, which is quite a lot?

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 9 of 13, by superfury

User metadata
Rank l33t
Rank
l33t

Another small PIQ improvement: optimized 8-bit FIFO buffer reads(~0.2% less) and writes(~0.6% less) by optimizing the check for free size(to be able to read/write a byte from the buffer) to perform a simplified check(instead of the full multibyte check) to check against a simple buffer full/empty instead of the exact number of bytes(seeing as 1+ bytes free/empty can be counted as valid to read/write a byte to the buffer head(read)/tail(write)). And then also optimized the full/empty case(head==tail) to check the full/empty space precalc(indicating the amount of free bytes in such a case) against 0(less to read from RAM) instead of the buffer size(one read from RAM).

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 10 of 13, by superfury

User metadata
Rank l33t
Rank
l33t

Hmmmm... Can't seem to get it past 37% speed(at 3MIPS in IPS clocking mode). Pretty much optimized everything that has to do with the PIQ reading memory(as well as the PIQ function fetching opcodes into the FIFO).

The heaviest part still seems to be just the prefetching process reading the bus(bus->MMIO handlers/RAM handler). In the case of the BIOS, most time is just spent reading the ROM array for the specified ROM and the returning of the byte value of 1(2 rows of code). Those almost take 1%(0.41 for the read, 0.57 for the return) of CPU time combined(in the total picture).

In total, it spends 8.43% on reading the opcodes from memory(that's 10% of total emulation processing!). Then 15.62% in total on all instruction reading, 12.03% on opcode handling itself, for 27.65% in total emulation of opcode reading/execution. Then some more small support(state committing, basic housekeeping), for a total of 33.43% on the entire CPU emulation core.
Then 16.09% on the VGA, 5 on DMA, 4 on UART, 2.8 on Sound Blaster, 2.3 on Game Blaster, 1.84 on PIT, 1.83 on ATA/ATAPI, 1.67 on audio generic, 0.98 on 8042, remainder all below that(modem, covox, CMOS, adlib). All in total 83.85%.

Edit: Inlining the PIQ filling routine after optimizing it bumps the maximum speed up a bit again: 38%(28% in real mode).

Windows 95 is now running at 20-22% speed(dipping lower to up to 7% when the HDD/CD-ROM LED's are blinking(screen updates taking a heavy toll)). Sometimes below to 19%, doing nothing on the desktop.

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 11 of 13, by danoon

User metadata
Rank Member
Rank
Member

I feel your pain with this, I spent a couple of months optimizing just memory reads and writes for my program. I don't have any numbers off the top of my head and it may not apply to modern processors, but in general I avoid indirect jumps as much as possible. So I avoid things like function pointers and virtual c++ when it comes to critical areas for performance. I follow Dosbox for my memory read and writes. The happy path, which is also the mostly likely path, has only 2 if statements and no indirect jumps. When I first started Boxedwine, every read/write would map to a page and each page could be a different type (subclass) and thus an indirect jmp would occur. Even though I liked the look of that code, it was easy to follow and debug, I eventually replaced it with how Dosbox does it.

http://www.boxedwine.org/

Reply 12 of 13, by superfury

User metadata
Rank l33t
Rank
l33t
danoon wrote on 2020-03-25, 01:54:

I feel your pain with this, I spent a couple of months optimizing just memory reads and writes for my program. I don't have any numbers off the top of my head and it may not apply to modern processors, but in general I avoid indirect jumps as much as possible. So I avoid things like function pointers and virtual c++ when it comes to critical areas for performance. I follow Dosbox for my memory read and writes. The happy path, which is also the mostly likely path, has only 2 if statements and no indirect jumps. When I first started Boxedwine, every read/write would map to a page and each page could be a different type (subclass) and thus an indirect jmp would occur. Even though I liked the look of that code, it was easy to follow and debug, I eventually replaced it with how Dosbox does it.

I just modified the 3 memory mapped I/O handlers to be directly called for 2 of them(BIOS ROM and VGA VRAM), while the BIOS ROM will call the option ROM handler itself(instead of using a different handler seperately).

I notice that this increases real mode speed by 1% at most, nothing else noteworthy improves with that optimization(the XT machine still uses one registered handler, which is the EMS memory handler, which only registers in certain cases(unlike the VGA and BIOS ROM, which are pretty much required to run). Protected mode doesn't seem to have any improvements(still 38%).

2 if statements is pretty much impossible with my configuration. Multiple option ROMs can by dynamically installed, so it requires a loop to scan the areas used.
The BIOS ROM does have a few if-statements, but they pretty much amount to nothing, according to the profiler. The only thing that actually takes time for the BIOS ROM(more than 0.01%) is the actual read from the ROM that's dynamically allocated(which is the following):

				*value = BIOS_ROMS[segment][tempoffset]; //Give the value!
return 1;

The return value meaning that the memory is mapped by the function(and not RAM) and the value being a pointer to a buffer for the result of the read.
Both statements take about 0.5%.

Edit: I've just changed those *value statements to a global byte variable instead. It looks like no speed up at all(a slowdown, if any)?

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases

Reply 13 of 13, by superfury

User metadata
Rank l33t
Rank
l33t

The TLB reads have been optimized for faster access, caching the MRU item.

Also optimized the PIQ fetching from memory to use a faster ordering of handling it's variables. It's now at up to 38%/39% at 3 MIPS.
Windows 95 is now running at ~22% on average, sometimes dipping to 19% and sometimes shooting up to 29%(still 10% overhead on paging lookups for every block's byte?).

See cpu/paging.c's readTLB function and cpu/biu.c's CPU_fillPIQ function.

Although, in the end, most of the time is still used by the loop and checks for memory protection in the BIU_DosboxTick's byte fetch loop(which does nothing more than discard the memory access timings from the PIQ fetching routine(for cycle-accurate compatibility) and check if all memory accesses that aren't faulted(e.g. until a page fault or different kind of fault point is reached) are handled).

And the other most heavy point is just the TLB fetching itself it seems. To be exact, the generation of the TLB Tag by using the different bits(parameters as seperate bytes) to lookup the specified TLB entry with.

Edit: Managed to optimize TLB tag generation slightly by moving it to the caller instead of during the TLB reading itself(which perhaps improves speed because it's coming together with TLB tag generation, causing it to be optimized better).

UniPCemu Git repository
UniPCemu for Android, Windows and PSP on itch.io
Older UniPCemu PC/Android/PSP releases