keenmaster486 wrote on 2025-05-08, 01:18:
I've been reading about the 8088's prefetch queue. So far I've been unable to achieve any performance improvements that I can definitively trace to keeping the prefetch queue full, though. Not really sure what I'm doing there.
The issue is likely that you can't keep the queue full. The 8088 has an undersized bus interface. The design of the execution unit is made for the 8086 which has twice the bandwidth, as it can fetch two bytes at once. Looking at your MartyPC screenshort, I see "dec di" taking 2 cycles. Yet it took the processor 4 cycles to fetch it (one byte). And I see "mov al, 4" listed at 4 cycles. Yet it took the processor 8 cycles to fetch that instruction (2 bytes). Looking at CB69, it took 8 cycles to execute "mov al,1", which seems to indicate that at this point the queue ran completely empty. I'm surprised the "out dx,al" instruction is that slow (14 cycles). The 8088 timing reference I have at hand lists that instruction as "8 cycles". These tables always assume that the prefetch queue is "sufficiently filled", which it is clearly not after you had a "mov al,1" instruction consuming 8 cycles. So you need 4 extra cycles to fetch the out instruction, which would arrive at 12 cycles (4 for fetching, then 8 for executing). I guess the reason is that the prefetching unit already started fetching a second byte before the OUT instruction could claim the bus, so the OUT instruction has to wait for the FSB to become idle. The subsequent XCHG instruction is most likely slowed down primarily by the EGA memory wait states.
As your code is (as most 8088 code) mostly bottlenecked by the FSB, it is really unfortunate that the IBM XT mainboard is unable to do write posting to ISA cards, which would be a big win here. If the EGA card is currently unable to accept a byte (video RAM is busy for display purposes), it blocks the bus until it is ready, which will not only prevent further progress in the execution unit, but it will also prevent the prefetch unit to fetch further instructions. If the EGA card on the ISA bus was "further away" from the processor than the RAM (like it is on 486 computers), you could have the chipset continue a write to EGA memory in the background on the ISA bus, while the processor is able to access memory on its frontside bus. Alas, the complexity of the XT mainboard is way too low for stuff like this, and having all RAM appear on the only bus the system has (the "ISA" bus) is a design feature, so it is impossible to separate EGA memory writes from RAM access.
I don't know MartyPC (except from reading VOGONs threads about the progress of emulating Area 5150), so I don't know whether you are able to get further details on the timing (like how many clocks the execution unit had to wait to perform a data access, as the prefetch unit was currently in progress of prefetching; like how many wait states you had on the ISA bus; like whether you lost some cycles because the ISA bus was handed over to the DMA controller for memory refresh). Seeing those details would greatly help you to understand why the same instruction is observed at different execution times.
keenmaster486 wrote on 2025-05-08, 01:18:
I have made a breakthrough on the masked tiles. Here's the secret: store the tiles in memory in packed plane format, i.e. 4-byte chunks representing the four planes for each byte in video memory. In my case, though, it's 5-byte chunks: the first byte is the mask. So I'm able to load both the mask value and the blue plane value with a single lodsw, and then I can use xchg to both load the latches and write the blue plane to the screen. Subsequent plane writes are done with movsb. The tradeoff is that I have to execute dec di for the green and red planes.
Have you compared the xchg approach (3 bytes) to "lodsb; mov es:[di], ah" (4 bytes) or "lodsb; mov al,ah; stosb; dec di" (5 bytes)? If you are bottlenecked on prefetching, the xchg instruction is the best approach, but if delays in the execution unit are also significant, you might get an improvement using the 4-byte approach. I guess the 5 byte instruction sequence will be the worst of all.
By the way, are you aware that the Microsoft macro assembler is able to expand macros and generate repeats? Using the repeat feature might make maintaining your code easier, unless you already generate it using some external tool (e.g. a python script).
keenmaster486 wrote on 2025-05-08, 01:18:
I also realized that I only have to update dl when I change which port I'm writing to, which makes port writes a bit faster, and sped up the unmasked routine slightly to 1169 tiles/sec in MartyPC.
Which is a clear indication that you are indeed bottlenecked by instruction fetches. The time spent by the execution unit on "MOV DX, 3CF" and "MOV DL, CF" is identical. But you need an extra bus cycle (i.e. four clocks on the bus) to fetch the instruction that updates the full DX register.
keenmaster486 wrote on 2025-05-08, 01:18:
My masked routine now looks like this, and scores 538 tiles/sec in MartyPC:
I took a look at it. I see no glaringly obvious way to improve it.