VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

I notice in my profiler that most execution time is actually seeming to be spent purely on handling memory accesses in UniPCemu. So that's mostly prefetching and memory checking itself(protection checks), according to the profiler. Both are already heavily optimized to use as much precalculated data as possible, but the compiler still says they're the hot paths it seems? And that's the emulator running at 3MIPS on a i7-4790K@4GHz CPU, which can run Dosbox at the same speed(3MIPS) with only a few percent CPU usage, while UniPCemu takes a whopping 15-20% CPU time for it to work!

Anyone has any tips on how to optimize it further? I basically already precalculated all values used whereever possible afaik. One thing that does in the hot path make it a bit heavy is that IPS clocking mode(vs cycle-accurate mode) keeps the prefetch buffer fully filled for each instruction, causing relatively many RAM accesses to keep it filled up, but that's required for software that's using self-modifying code.

Last edited by superfury on 2020-03-19, 22:26. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 1 of 22, by Stenzek

User metadata
Rank Newbie
Rank
Newbie

Cache page table entries in a TLB-like data structure. This made a significant difference in my x86 emulator.

However, if you're going for "true" emulation then you're limited to what the CPU you're emulating has in terms of a TLB.

Reply 2 of 22, by danoon

User metadata
Rank Member
Rank
Member

I can speak for Dosbox a little bit, one thing they do to speed up memory access is to check if a 2 or 4 byte read/write crosses a page boundary. If it does, then it will be split up into separate 1 byte read/write. If it is on the same page, then all the permission checks, etc, will only be performed once.

http://www.boxedwine.org/

Reply 3 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

I'm already using a doubly-linked list for the TLB's emulation, which sped it up quite a bit. That's just reading a pointer for most accesses, only moving other items to the head of the queue when it's not the most recent(or allocated at all) entry. It will still slowdown a bit in protected mode, but that's mainly due to the memory access overhead for it's structures(descriptor fetching(8 bytea)&writeback(1-byte for writeback) and paging 4-byte entries).

The main issue is mainly the way it tries to keep the prefetch queue fully filled for each instruction(in IPS clocking mode, which uses Dosbox-style clocking and effectively disables the cycle-accurate BIU, performing it all in one go when handling the physical memory/hardware access, in up to 4(dword) seperate byte accesses(to allow for hardware mapped at any aligned or unaligned access)), mainly to allow for SMC(Self-Modifying Code).

Also, the prefetching process during IPS clocking mode tries to reduce it's memory checks as much as possible(first byte(at CS:EIP or next instruction), last byte(if different) within segment limits and perhaps the page before it if that fails(not mapped in TLB))),

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

I know that for each instruction executed, it will prefetch the instruction length's worth of bytes from memory when starting the next instruction, to keep the prefetch buffer filled.

So at worst, a 12-byte instruction will make it fetch 12 bytes from memory, filling the PIQ.
The worst case scenario happens with jumps, calls and interrupts, which flush the entire queue, making it fetch the entire queue of data(80386 being 16, 80486 being 32, Pentium being a whopping 64!).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just optimized the prefetching process to not fetch more than the maximum instruction length(as supported by the protection mechanism) when handling IPS clocking mode on a 80286+.

So a 80286 will only fetch until it reaches 10 bytes filled in the PIQ, a 80386 and up with 15 bytes in the PIQ(instead of 16(80386), 32(80486) or 64(Pentium!)).

That should improve speed by a little bit(quite a bit on 80486 and Pentium!).

Edit: Only sped it up a little bit. Now at 3 MIPS, it runs at a speed of around 19-20% running Windows 95(Intel i7-4790K@4.0GHz). Although sometimes it will dip a bit, going no lower than 15%(depending on hard disk accesses and possibly other factors). It tops out at 36% still.

Tried playing some videos from the Windows 95 CD-ROM(at 32K colors ET4000). That ran at about the same speed.

POST at 30%, dipping to 25% when doing quick display updates until reaching 9.9MB checked, then 34-36% until 16MB RAM checked. Then 29-32% during FDC checks. It jumps to ~48% when counting down the Plop BIOS ROM timer(for pressing D to disable it, it's probably using HLT or something like that to wait for interrupts?). Then at 27-29% waiting for the F1 key to resume booting(1782-Disk Controller Failure due to the XT-IDE 80386 ROM(Which is a known issue of that ROM)). Then 27% until booting Windows 95.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 22, by Stenzek

User metadata
Rank Newbie
Rank
Newbie
superfury wrote on 2020-03-18, 10:01:

I'm already using a doubly-linked list for the TLB's emulation, which sped it up quite a bit. That's just reading a pointer for most accesses, only moving other items to the head of the queue when it's not the most recent(or allocated at all) entry. It will still slowdown a bit in protected mode, but that's mainly due to the memory access overhead for it's structures(descriptor fetching(8 bytea)&writeback(1-byte for writeback) and paging 4-byte entries).

Could you use an array/lookup table here instead? While yeah, you get the temporal locality when you hit the same page a bunch of times, as soon as it's anything else that's a lot of work to scan the list when compared to a table lookup.

Reply 7 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++
Stenzek wrote on 2020-03-20, 11:18:
superfury wrote on 2020-03-18, 10:01:

I'm already using a doubly-linked list for the TLB's emulation, which sped it up quite a bit. That's just reading a pointer for most accesses, only moving other items to the head of the queue when it's not the most recent(or allocated at all) entry. It will still slowdown a bit in protected mode, but that's mainly due to the memory access overhead for it's structures(descriptor fetching(8 bytea)&writeback(1-byte for writeback) and paging 4-byte entries).

Could you use an array/lookup table here instead? While yeah, you get the temporal locality when you hit the same page a bunch of times, as soon as it's anything else that's a lot of work to scan the list when compared to a table lookup.

Won't an full array of memory pages be inaccurate compared to a real CPU? UniPCemu currently has about 8 pages for each memory area(depending on the high bits of the linear memory address).

Most of the page allocation/deallocation/LRU is just a call to this function:

//Move a TLB entry index from an old list to a new list!
OPTINLINE void Paging_moveListItem(TLB_ptr *listitem, TLB_ptr **newlist_head, TLB_ptr **newlist_tail, TLB_ptr **oldlist_head, TLB_ptr **oldlist_tail)
{
if (likely(*newlist_head == listitem)) return; //Don't do anything when it's already at the correct spot!

//First, remove us from the old head list!
if (listitem->prev) //Do we have anything before us?
{
((TLB_ptr *)listitem->prev)->next = listitem->next; //Remove us from the previous item of the list!
}
else //We're the head, so remove us from the list!
{
*oldlist_head = listitem->next; //Remove us from the head of the list and assign the new head!
}

if (listitem->next) //Did we have a next item?
{
((TLB_ptr *)listitem->next)->prev = listitem->prev; //Remove us from the next item of the list!
}
else //We're the tail?
{
*oldlist_tail = listitem->prev; //Remove us from the tail of the list and assign the new tail!
}

listitem->next = NULL; //We don't have a next!
listitem->prev = NULL; //We don't have a previous!

/* Now, we're removed from the old list and a newly unmapped item! */

//Now, insert us into the start of the new list!
if (*newlist_head) //Anything in the new list already?
{
(*newlist_head)->prev = listitem; //We're at the start of the new list, so point the head to us, us to the head and make us the new head!
listitem->next = *newlist_head; //Our next is the old head!
*newlist_head = listitem; //We're the new head!
}
else //We're the new list?
{
*newlist_head = listitem; //We're the new head!
*newlist_tail = listitem; //We're the new tail!
}
}

It then manages them using the following functions:

OPTINLINE TLB_ptr *allocTLB(sbyte set) //Allocate a TLB entry!
{
TLB_ptr *result;
if (CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]) //Anything available?
{
result = CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]; //What item are we allocating, take it from the free list!
//Now take the item from the pool and move it to the used list!
CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]->allocated = 1; //We're allocated now!
Paging_moveListItem(result, //What item to take!
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set], //destination head
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set], //destination tail
&CPU[activeCPU].Paging_TLB.TLB_freelist_head[set], //source head
&CPU[activeCPU].Paging_TLB.TLB_freelist_tail[set]); //source tail
return result; //Give the result!
}
return NULL; //Nothing to allocate!
}

OPTINLINE void freeTLB(sbyte set, TLB_ptr *listitem) //Make an entry available again!
{
if (listitem->allocated) //Are we allocated at all?
{
listitem->allocated = 0; //Mark us as freed!
Paging_moveListItem(listitem, //What item to take!
&CPU[activeCPU].Paging_TLB.TLB_freelist_head[set], //destination head
&CPU[activeCPU].Paging_TLB.TLB_freelist_tail[set], //destination tail
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set], //source head
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]); //source tail
}
}

OPTINLINE void Paging_setNewestTLB(sbyte set, TLB_ptr *listitem) //Tick an TLB entry for making it the most recently used!
{
if (listitem->allocated) //Are we allocated at all?
{
Paging_moveListItem(listitem,
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set],
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set],
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set],
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]); //Move us to the start of the TLB used list to mark us as the most recently accessed!
}
else //We're not allocated, but marked as newest? Allocate us!
{
//Now take the item from the pool and move it to the used list!
listitem->allocated = 1; //We're allocated now!
Paging_moveListItem(listitem, //What item to take!
&CPU[activeCPU].Paging_TLB.TLB_usedlist_head[set], //destination head
&CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set], //destination tail
&CPU[activeCPU].Paging_TLB.TLB_freelist_head[set], //source head
&CPU[activeCPU].Paging_TLB.TLB_freelist_tail[set]); //source tail
}
}

OPTINLINE TLBEntry *Paging_oldestTLB(sbyte set) //Find a TLB to be used/overwritten!
{
TLB_ptr *listentry;
if (CPU[activeCPU].Paging_TLB.TLB_freelist_head[set]) //Anything not allocated yet?
{
if ((listentry = allocTLB(set))) //Allocated from the free list?
{
Show last 18 lines
			return listentry->entry; //Give the index of the resulting entry that's been allocated!
}
}
else //Allocate from the tail(LRU), since everything is allocated!
{
if (CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]) //Gotten a tail? We're used, so take the LRU!
{
listentry = CPU[activeCPU].Paging_TLB.TLB_usedlist_tail[set]; //What entry to take: the LRU!
Paging_setNewestTLB(set, listentry); //This is the newest TLB now!
return listentry->entry; //What index is the LRU!
}
}
byte indexsize, whichentry;
indexsize = /*(8>>((EMULATED_CPU>=CPU_80486)?1:0))*/ 4;
whichentry = (set*indexsize)+3; //Which one?

return &CPU[activeCPU].Paging_TLB.TLB[whichentry]; //Safety: return the final entry! Shouldn't happen under normal circumstances.
}

Then, reading the TLB for a given memory address is the following:

//RWDirtyMask: mask for ignoring set bits in the tag, use them otherwise!
byte Paging_readTLB(byte *TLB_way, uint_32 logicaladdress, byte W, byte U, byte D, byte A, byte S, uint_32 WDMask, uint_32 *result, byte updateAges)
{
INLINEREGISTER uint_32 TAG, TAGMask;
INLINEREGISTER TLB_ptr *curentry;
sbyte TLB_set;
TLB_set = Paging_TLBSet(logicaladdress,S); //Auto set?
curentry = CPU[activeCPU].Paging_TLB.TLB_usedlist_head[TLB_set]; //What TLB entry to apply?
if (likely(curentry)) //Valid entries to search?
{
TAGMask = ~WDMask; //Store for fast usage to mask the tag bits unused off!
TAGMask &= (curentry->entry->addrmask | 0xFFF); //The full search mask, with the address width(KB vs MB) applied!
TAG = Paging_generateTAG(logicaladdress, W, U, D, A, S); //Generate a TAG!
TAG &= TAGMask; //Premask the search tag for faster comparison!

for (; curentry;) //Check all entries that are allocated!
{
if (likely((curentry->entry->TAG&TAGMask) == TAG)) //Found and allocated?
{
*result = curentry->entry->data; //Give the stored data!
Paging_setNewestTLB(TLB_set, curentry); //Set us as the newest TLB!
if (unlikely(TLB_way)) //Requested way?
{
*TLB_way = curentry->index; //What way was found!
}
return 1; //Found!
}
curentry = (TLB_ptr *)(curentry->next); //Next entry!
}
}
return 0; //Not found!
}

That's most of the active work on the paging mechanism during the usage of already loaded TLB memory. It's just a call to Paging_readTLB to get the most recent(usually) or perhaps 7 less recent(when it's fully filled) or 8 when not in the TLB. Usually., during most memory accesses, the found case is reached(unless it's not the most recent, in which case it's up to 7 more entries being prcessed, 8 if it's not found while it's full(and 0 when not found and the TLB set is empty).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 8 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

Also, what I'm currently testing on mostly is the BIOS POST running(which doesn't enable paging at all). I already see 36.58% being spent on executing the virtual CPU, of which 16.99% on reading opcodes and prefixes(and 12.37% on executing the instructions themselves), of which 12.79% fetching the opcode bytes themselves, mostly 4.37% on memory access checking(including non-paging memory for 1.99%, the prefetch queue(0.48%), keeping the current opcode 0.08% and 0.13% on determining the current default operand/address sizes for the instruction). Of said opcode byte fetching, 9.17% is spent purely on fetching the opcodes from memory to the PIQ.

Said function(BIU_dosboxTick, for 9.17%), 3.52% is it's own code, 2.33% for the bus access for memory(anything read from RAM/ROM/MMIO), 1.22% for checking the paging unit's address(checkMMUaccess(), which calls the mmu's checkDirectMMUaccess(), which does nothing for this code(paging is disabled, so it simply returns 0). Said function will usually call CPU_Paging_checkPage, which will in turn call readTLB up to 2 times(once for 4MB(Pentium only) and once for 4KB pages, finishing on the very first one found)). Then there's 0.94% writing the PIQ data to the queue,0.47% checking the PIQ's free size and finally 0.51% calculating the linear memory address from the segment:offset pair.

Then, of the bus access(2.33%), it becomes 3.12% including normal instruction memory reads, of which 1.44% is reading the BIOS ROM in memory. That's one function executing only: BIOS_readhandler.Of which 0.10% is determining it's the ROM, 0.08 checking for a linear ROM(custom ROM), 0.12% determining the ROMs used(U18/19, U13/15, U27/47 or U34/35), 0.12% checking for the doubled ROM substraction, 0.08% determining the ROM number(13 or 15) and checking for out-of-range of the allocated ROM, finally(the most heavy part) 0.27% reading the byte from the ROM and storing it in the result, 0.47% returning success(which is a simply "return 1;") to make the MMU not map to RAM or NULL(which is the case with the ROM area) instead.

Edit: Just saw a little optimization(just a little one): The check for the PIQ's free size is done twice(once at the start, once further down). Those can be easily combined into 1 call only(it doesn't change in the meanwhile anyway).
Edit: Said optimization brings checking the PIQ free size down to 0.27%(previously 0.47%).
The function itself is now brought down to 2.83% for it's own contents, 7.6% for the entire function(previously 9.17% for the entire function and 3.52% for it's contents respectively).

Still, 7.6% is still almost 1/10th of the entire execution time, which is quite a lot?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

Another small PIQ improvement: optimized 8-bit FIFO buffer reads(~0.2% less) and writes(~0.6% less) by optimizing the check for free size(to be able to read/write a byte from the buffer) to perform a simplified check(instead of the full multibyte check) to check against a simple buffer full/empty instead of the exact number of bytes(seeing as 1+ bytes free/empty can be counted as valid to read/write a byte to the buffer head(read)/tail(write)). And then also optimized the full/empty case(head==tail) to check the full/empty space precalc(indicating the amount of free bytes in such a case) against 0(less to read from RAM) instead of the buffer size(one read from RAM).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 10 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

Hmmmm... Can't seem to get it past 37% speed(at 3MIPS in IPS clocking mode). Pretty much optimized everything that has to do with the PIQ reading memory(as well as the PIQ function fetching opcodes into the FIFO).

The heaviest part still seems to be just the prefetching process reading the bus(bus->MMIO handlers/RAM handler). In the case of the BIOS, most time is just spent reading the ROM array for the specified ROM and the returning of the byte value of 1(2 rows of code). Those almost take 1%(0.41 for the read, 0.57 for the return) of CPU time combined(in the total picture).

In total, it spends 8.43% on reading the opcodes from memory(that's 10% of total emulation processing!). Then 15.62% in total on all instruction reading, 12.03% on opcode handling itself, for 27.65% in total emulation of opcode reading/execution. Then some more small support(state committing, basic housekeeping), for a total of 33.43% on the entire CPU emulation core.
Then 16.09% on the VGA, 5 on DMA, 4 on UART, 2.8 on Sound Blaster, 2.3 on Game Blaster, 1.84 on PIT, 1.83 on ATA/ATAPI, 1.67 on audio generic, 0.98 on 8042, remainder all below that(modem, covox, CMOS, adlib). All in total 83.85%.

Edit: Inlining the PIQ filling routine after optimizing it bumps the maximum speed up a bit again: 38%(28% in real mode).

Windows 95 is now running at 20-22% speed(dipping lower to up to 7% when the HDD/CD-ROM LED's are blinking(screen updates taking a heavy toll)). Sometimes below to 19%, doing nothing on the desktop.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 11 of 22, by danoon

User metadata
Rank Member
Rank
Member

I feel your pain with this, I spent a couple of months optimizing just memory reads and writes for my program. I don't have any numbers off the top of my head and it may not apply to modern processors, but in general I avoid indirect jumps as much as possible. So I avoid things like function pointers and virtual c++ when it comes to critical areas for performance. I follow Dosbox for my memory read and writes. The happy path, which is also the mostly likely path, has only 2 if statements and no indirect jumps. When I first started Boxedwine, every read/write would map to a page and each page could be a different type (subclass) and thus an indirect jmp would occur. Even though I liked the look of that code, it was easy to follow and debug, I eventually replaced it with how Dosbox does it.

http://www.boxedwine.org/

Reply 12 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++
danoon wrote on 2020-03-25, 01:54:

I feel your pain with this, I spent a couple of months optimizing just memory reads and writes for my program. I don't have any numbers off the top of my head and it may not apply to modern processors, but in general I avoid indirect jumps as much as possible. So I avoid things like function pointers and virtual c++ when it comes to critical areas for performance. I follow Dosbox for my memory read and writes. The happy path, which is also the mostly likely path, has only 2 if statements and no indirect jumps. When I first started Boxedwine, every read/write would map to a page and each page could be a different type (subclass) and thus an indirect jmp would occur. Even though I liked the look of that code, it was easy to follow and debug, I eventually replaced it with how Dosbox does it.

I just modified the 3 memory mapped I/O handlers to be directly called for 2 of them(BIOS ROM and VGA VRAM), while the BIOS ROM will call the option ROM handler itself(instead of using a different handler seperately).

I notice that this increases real mode speed by 1% at most, nothing else noteworthy improves with that optimization(the XT machine still uses one registered handler, which is the EMS memory handler, which only registers in certain cases(unlike the VGA and BIOS ROM, which are pretty much required to run). Protected mode doesn't seem to have any improvements(still 38%).

2 if statements is pretty much impossible with my configuration. Multiple option ROMs can by dynamically installed, so it requires a loop to scan the areas used.
The BIOS ROM does have a few if-statements, but they pretty much amount to nothing, according to the profiler. The only thing that actually takes time for the BIOS ROM(more than 0.01%) is the actual read from the ROM that's dynamically allocated(which is the following):

				*value = BIOS_ROMS[segment][tempoffset]; //Give the value!
return 1;

The return value meaning that the memory is mapped by the function(and not RAM) and the value being a pointer to a buffer for the result of the read.
Both statements take about 0.5%.

Edit: I've just changed those *value statements to a global byte variable instead. It looks like no speed up at all(a slowdown, if any)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 13 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

The TLB reads have been optimized for faster access, caching the MRU item.

Also optimized the PIQ fetching from memory to use a faster ordering of handling it's variables. It's now at up to 38%/39% at 3 MIPS.
Windows 95 is now running at ~22% on average, sometimes dipping to 19% and sometimes shooting up to 29%(still 10% overhead on paging lookups for every block's byte?).

See cpu/paging.c's readTLB function and cpu/biu.c's CPU_fillPIQ function.

Although, in the end, most of the time is still used by the loop and checks for memory protection in the BIU_DosboxTick's byte fetch loop(which does nothing more than discard the memory access timings from the PIQ fetching routine(for cycle-accurate compatibility) and check if all memory accesses that aren't faulted(e.g. until a page fault or different kind of fault point is reached) are handled).

And the other most heavy point is just the TLB fetching itself it seems. To be exact, the generation of the TLB Tag by using the different bits(parameters as seperate bytes) to lookup the specified TLB entry with.

Edit: Managed to optimize TLB tag generation slightly by moving it to the caller instead of during the TLB reading itself(which perhaps improves speed because it's coming together with TLB tag generation, causing it to be optimized better).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 14 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

Tried to optimize it further by caching all possible aligned 16-bit and 32-bit memory addresses(except not on uncachable memory mapped devices(EMS memory and VGA VRAM)). Now the normal RAM and all different ROMs are cached in the BIU for allowing theoretically faster memory reads.

So reading any 8-bit memory location byte causes the BIU to cache it in a 16-bit or 32-bit cache, depending on alignment(32-bit alignment is cached whenever possible(when MOD 4==0), otherwise 16-bit caching(when MOD 2==0) or no caching otherwise).

So now, for all those memory addresses that are cached(e.g. for reading a MOD 4 RAM address) causes the BIU to latch said data into it's own buffer. And when it tries to read said address again(unless the MMU or BIOS ROM invalidates it because it's changed by a memory write by CPU or DMA), it will read from said cache instead of going through the whole memory mapping again.

Although the extra checks on the ROM read routines and memory read/write routines makes them a bit heavier to use now.

Theoretically, the CPU should be running at a much higher speed now(assuming the memory reads are mostly using MOD 4 or MOD 2 addresses, with the contents being accessed afterwards being within the same word/dword.

But when I try checking the speed it gives, it seems to have gotten even slower, now reporting 27%-28% at the end of the POST? And not much past 30% at most during POST?

Anyone knows why this is happening? Shouldn't it be much faster because of this optimizing of RAM/ROM accesses(Logically implemented based on basic Bochs documentation on it)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 15 of 22, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote on 2020-03-17, 00:52:

which can run Dosbox at the same speed(3MIPS) with only a few percent CPU usage, while UniPCemu takes a whopping 15-20% CPU time for it to work!

dosbox as your measuring stick is a poor choice.

dosbox does a lot of things in the name of speed. like, it lets you write to your own CS in protected mode which should be a GPF, but dosbox does not check that.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 16 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

With the latest optimization(BIU word/dword caching depending on the data alignment), the speed is a bit faster again. Now it's up to 33% during boot, while during running Windows 95 RTM it's at 21-23% most of the time(with some drops into down to 14% when doing stuff like loading an application(which probably is mostly due to page faults being thrown(or interrupts processing in general) and display updates.

Real mode of course is faster due to more simple segment loading(not using lots of protected-mode only checks), more simple interrupt handling(although in a cycle-accurate way, so a bit of seperated step overhead there) and not using the paging unit to translate memory. That can still go up to 37% it seems(at least when the text that it's safe to turn off your computer is displayed).

So the performance is a bit better with the new word/dword caching on the BIU. That mostly affects prefetching though(especially in IPS clocking mode, where it reads linear blocks of RAM/ROM into the PIQ a lot, during every instruction(for the amount of bytes of an instruction(during any jump, which is limited to the maximum instruction length, which is 12 bytes at most) or the length of the current instruction(when executing the next instruction))). So in that case(if it's allowed to work optimally, with dword aligned addresses to prefetch on each instruction), it fetches 4 bytes in one swoop, then parses it in the BIU one byte at a time. The same happens with word/dword reads from memory, but it shares the cache for memory reads for instructions with the cache for the instruction reads. So an instruction reading data from memory(any instruction that does that) invalidates the cache and loads it with the data from the memory read operation instead(e.g. data pushed on/from the stack or some ModR/M access data read).
So that still slows it down(it doesn't differentiate between instruction data being read and instructions being prefetched). But the slowdown should be less than what it used to be.

So it's kind of in-between byte fetching and Bochs' dword-when-possible fetching method. It will fetch dwords and parse them as bytes when performing an aligned read(for prefetching or for normal instructions), but does words as well(when not dword aligned) and only falling back to single bytes being read when the address isn't word-aligned.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 17 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just optimized reading even further. It now tries to read a full dword in all cases, except just discarding part of it by shifting the data that's not requested off. So this allows it to read 1,2,3 or 4 bytes at a time from any RAM, depending on the address modulo 3. So reading from modulo 3 being 0 causes it to read a full dword, when it's 1 it reads 3 bytes at once, when it's 2 it reads 2 bytes at once and when it's 3 it reads only 1 byte(as there's only 1 byte at the last byte of said dword).

This is now done for all memory reads(except (S)VGA VRAM, which is hardcoded for 1 byte at a time due to the shift registers etc.).
So now, all memory and ROM reads cache the entire dword and only read it once during a byte/word/dword read operation (unless it's overwritten by performing a memory(or flashing) write to the cached addresses).

The BIU in turn will now(when reading any memory address) check that dword cache. If it's cached and the MMU still reports the last dword is cached in it(and not invalidated using memory (or flashing) writes), it will use it's local cache instead, resulting in a simple shifting of the requested byte out of the cache. And of course, this includes repeated reads from the same address block that's cached(so reading and not modifying the same dword multiple times doesn't traverse all the way to actual RAM/ROM and uses the BIU cache instead).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 18 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just added another little modification based on danoon's comment. It now just performs the paging check to be for the low memory byte(not the higher one) in a word/dword only when both x and x+3 refer to the same page(linear address MOD 4096). Although it might not be optimized that way with 4MB pages when it's a different 4KB page within a 4MB(Pentium and up) page, which will perform 2 checks instead of 1.
The normal segmentation check still performs 2 checks always, because it must take the segment limit into account, which requires a check for the low below the limit and high above the limit to error out properly(e.g. to catch limit x; word offset x and limit x; dword offset x-(0 through 2)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 19 of 22, by superfury

User metadata
Rank l33t++
Rank
l33t++

After some more optimization(mainly the DMA controller a bit), I can now see it reaching 39% at most(during the slowed down 10-16MB POST RAM test! That's almost as fast as the 8MHz 80286(which ran at about 40%) used to be.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io