Also, what I'm currently testing on mostly is the BIOS POST running(which doesn't enable paging at all). I already see 36.58% being spent on executing the virtual CPU, of which 16.99% on reading opcodes and prefixes(and 12.37% on executing the instructions themselves), of which 12.79% fetching the opcode bytes themselves, mostly 4.37% on memory access checking(including non-paging memory for 1.99%, the prefetch queue(0.48%), keeping the current opcode 0.08% and 0.13% on determining the current default operand/address sizes for the instruction). Of said opcode byte fetching, 9.17% is spent purely on fetching the opcodes from memory to the PIQ.
Said function(BIU_dosboxTick, for 9.17%), 3.52% is it's own code, 2.33% for the bus access for memory(anything read from RAM/ROM/MMIO), 1.22% for checking the paging unit's address(checkMMUaccess(), which calls the mmu's checkDirectMMUaccess(), which does nothing for this code(paging is disabled, so it simply returns 0). Said function will usually call CPU_Paging_checkPage, which will in turn call readTLB up to 2 times(once for 4MB(Pentium only) and once for 4KB pages, finishing on the very first one found)). Then there's 0.94% writing the PIQ data to the queue,0.47% checking the PIQ's free size and finally 0.51% calculating the linear memory address from the segment:offset pair.
Then, of the bus access(2.33%), it becomes 3.12% including normal instruction memory reads, of which 1.44% is reading the BIOS ROM in memory. That's one function executing only: BIOS_readhandler.Of which 0.10% is determining it's the ROM, 0.08 checking for a linear ROM(custom ROM), 0.12% determining the ROMs used(U18/19, U13/15, U27/47 or U34/35), 0.12% checking for the doubled ROM substraction, 0.08% determining the ROM number(13 or 15) and checking for out-of-range of the allocated ROM, finally(the most heavy part) 0.27% reading the byte from the ROM and storing it in the result, 0.47% returning success(which is a simply "return 1;") to make the MMU not map to RAM or NULL(which is the case with the ROM area) instead.
Edit: Just saw a little optimization(just a little one): The check for the PIQ's free size is done twice(once at the start, once further down). Those can be easily combined into 1 call only(it doesn't change in the meanwhile anyway).
Edit: Said optimization brings checking the PIQ free size down to 0.27%(previously 0.47%).
The function itself is now brought down to 2.83% for it's own contents, 7.6% for the entire function(previously 9.17% for the entire function and 3.52% for it's contents respectively).
Still, 7.6% is still almost 1/10th of the entire execution time, which is quite a lot?