Pulling my hair out with EGA programming.

Reply 40 of 61, by keenmaster486

Posted on 2025-05-05, 23:50

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

riplin wrote on 2025-05-05, 23:33:

Sorry, I just blundered into this thread and saw that IO operation sitting outside your line loop. Is it not possible to do this at the sprite level or even scene level? Maybe it’s diminishing returns.

Have you calculated SOL (speed of light)? What’s the fastest possible you can achieve on this hardware? How far away are you from that?

Hmm good question. Not sure how I would calculate that. I did find this post from Jim Leonard though:

REP STOS is *faster* than REP MOVS on my 4.77MHz 8088 (REP STOSW
fills main memory @ 639KB/s; REP MOVSW copies main memory @ 352KB/s)

https://comp.lang.asm.x86.narkive.com/E5CeEnj … -improved#post2

Each tile is 128 bytes. But this is main memory and not video memory, so maybe that doesn’t help to calculate theoretical max tiles per second.

World's foremost 486 enjoyer.

Reply 41 of 61, by keenmaster486

Posted on 2025-05-06, 00:56

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

Okay, rewriting using MOVSW results in a speed increase. 614 tiles/sec with loops, and 834 tiles/sec unrolled.

The inner loop now simply consists of:

1movsw
2add di, 42

Not sure how much simpler I can get than this. The trouble really is the fact that I have to increment the vmem pointer by 44 every time (movsw already increments it by 2 automatically, hence the 42).

World's foremost 486 enjoyer.

Reply 42 of 61, by keenmaster486

Posted on 2025-05-06, 02:47

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

Rewrote the masked routine in asm as well now. A small performance improvement of 8% with MOVSB (329 tiles/sec vs 303 for the pointer arithmetic routine).

It was slower when I tried to use mkarcher's XCHG trick (289 tiles/sec). Not sure why yet.

I did write it with the plane loop on the outside though, to avoid doing too much arithmetic in the address calculations.

World's foremost 486 enjoyer.

Reply 43 of 61, by keenmaster486

Posted on 2025-05-06, 03:36

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

Moving the plane loop to the inside gets me 321 tiles/second. So the winner so far, if not by much, is MOVSB with an outer plane loop and inner line loop, but fully unrolled.

The story thus far (on emulated 8088 4.77 MHz in 86Box, haven't tried on real hardware yet but I do have a 5160 with EGA):

My best high level routines with pointer arithmetic:
Unmasked: 609 tiles/sec
Masked: 303 tiles/sec

My best routines so far with assembly:
Unmasked: 834 tiles/sec
Masked: 329 tiles/sec

So far this feels like diminishing returns.

World's foremost 486 enjoyer.

Reply 44 of 61, by keenmaster486

Posted on 2025-05-06, 04:55

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

This is what my masked routine currently looks like:

1mov si, dataOffset
2mov bx, si
3add bx, 128
4mov ax, 0A000h
5mov es, ax
6mov di, memoffset
7
8mov dx, 3CEh
9mov al, 08h
10out dx, al
11
12mov dx, 3C4h
13mov al, 02h
14out dx, al
15
16
17// NOTE: repeat this block for each plane
18  mov dx, 3C5h
19  mov al, 01h
20  out dx, al
21  mov dx, 3CFh
22  // NOTE: repeat the following block of code 16 times
23  mov al, [bx]
24  out dx, al
25  mov al, es:[di]
26  movsb
27  inc bx
28  mov al, [bx]
29  out dx, al
30  mov al, es:[di]
31  movsb
32  inc bx
33  add di, 42

The masked drawing is where the pain point really is, especially since I'll be using masked drawing for sprites once I get to that point, and for that I'll have to use jump instructions, which apparently are expensive, since I won't know the dimensions of the image beforehand.

World's foremost 486 enjoyer.

Reply 45 of 61, by keenmaster486

Posted on 2025-05-06, 15:18

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

Tried replacing the two instances of mov al, [bx] with one mov ax, [bx] and an xchg ah, al - and it was actually a tiny bit slower than doing two separate memory reads. Interesting.

World's foremost 486 enjoyer.

Reply 46 of 61, by mkarcher

Posted on 2025-05-06, 17:52

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3302
Joined: 2019-01-19, 16:29
Location: Germany

To further improve on the unmasked tiles, you likely need to use screen-to-screen copies using mode 1. I understand that you can't reserve fixed tile memory because you want keen-like "infinite scrolling", this the frame buffer moves all over the 256KB of EGA video memory. Nevertheless, you can have a tile set in video memory that "jumps around" by evicting some tiles if they are going to be hit by the frame buffer to a space quite far away from the current frame buffer. You can't do write mode 1 tricks for masked writes, as you can not have both the "background" and the "foreground" in the 32-bit latches, and you can not perform a 32-bit operation between the latch and a video memory contents (which could allow a masked update of the latches), but only 32-bit operations between expanded CPU data and latch contents.

For the 16-bit optimization, i.e. using MOV AX, [BX] to load the mask, do not use XCHG. That instruction is slow. Just use "MOV AL,AH" instead of "MOV AL,[BX]" for the second pixel.

When you unroll, you can use the "MOV AL,[bx+1]" form to load data for the second pixel, "MOV AL,[bx+2]" for the third one and so on. Up to +127, this will enlarge the mov instruction by one byte, which you get back in return by omitting the INC instruction. 8088 address calculation is likely slower with the offset, but OTOH you avoid the execution time of the INC instruction. I don't know offhand whether this will be a net win, but it is worth a shot.

When you said my xchg suggestion was slower, is this what you tried?

1lodsb               ; load pixel byte
2xchg es:[di], al    ; load latches, merge pixel
3inc di

Your target system seems to be a 5160 with an original EGA card. For that card, you should be aware of its low overall performance. In 320-pixel graphics mode, you get one chance to do an 8-bit read or write per unit of 16 pixels displayed. A unit of 16 pixels takes 32 clocks of the 14.318 crystal. A processor clock takes three clocks of the same crystal, so you get a chance to read or write a byte around every 10 processor clocks. This might actually be the reason ehy the xchg trick doesn't help. After finishing the read performed by XCHG, the write can not be finished earlier than 10 processor clocks later. As in the optimal case (no idea whether the EGA can hit it) a bus cycle takes 4 clocks, you have at least 6 clocks wasted if you try to perform back-to-back cycles to EGA RAM (whether read or write).

Reply 47 of 61, by keenmaster486

Posted on 2025-05-06, 20:41

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

Thanks for the info, mkarcher. Using it, I made the following improvements:

-Used mov al, ah instead of xchg for word size mov from mask data (333 tiles/sec)
-Used address addition rather than incrementing bx (343 tiles/sec)

I tried the lodsb method for the xchg trick, but although it was faster than what I was doing before (mov al, [si]), it was only 328 tiles/sec, so still a little slower than using movsb.

One more thing I tried: reordering the instructions so that reads and/or writes to video memory are spaced out with some instructions in between. Unfortunately this made no difference at all, and I am stuck at 343 tiles/sec.

I worry that keeping tiles in video memory and having to "evict" them when the buffer collides with them will eat up more precious render time than I would gain by using write mode 1, especially since the pain point here is not so much unmasked writes, but masked writes, and if I can't improve masked writes enough, they'll bottleneck the whole thing no matter how fast the unmasked routines are. Besides that, I have ideas of making the buffers as large as possible and using idle time to fill them with tiles so that the scroll routines have a significant number of tiles on either side ready to go.

World's foremost 486 enjoyer.

Reply 48 of 61, by keenmaster486

Posted on 2025-05-06, 21:40

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

mkarcher wrote on 2025-05-06, 17:52:

Your target system seems to be a 5160 with an original EGA card. For that card, you should be aware of its low overall performance. In 320-pixel graphics mode, you get one chance to do an 8-bit read or write per unit of 16 pixels displayed. A unit of 16 pixels takes 32 clocks of the 14.318 crystal. A processor clock takes three clocks of the same crystal, so you get a chance to read or write a byte around every 10 processor clocks. This might actually be the reason ehy the xchg trick doesn't help. After finishing the read performed by XCHG, the write can not be finished earlier than 10 processor clocks later. As in the optimal case (no idea whether the EGA can hit it) a bus cycle takes 4 clocks, you have at least 6 clocks wasted if you try to perform back-to-back cycles to EGA RAM (whether read or write).

So this is making me think - does that mean I can calculate the theoretical maximum number of unmasked tiles that can be displayed per frame? If I'm targeting 30fps, i.e. half the EGA's vertical refresh rate, and I can write a maximum of 20*200=4000 bytes per vertical refresh which makes 8000 bytes per frame, and each tile is 128 bytes, then the theoretical maximum for the EGA card itself would be 62.5 tiles per frame.

This certainly doesn't play out in 86Box, since when I set the machine type to 486DX2/66 with an "IBM EGA" card as the display adapter type, I get 7488 unmasked tiles/sec, which comes out to about 249 tiles/frame at EGA 30fps.

World's foremost 486 enjoyer.

Reply 49 of 61, by keenmaster486

Posted on 2025-05-06, 23:25

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

MartyPC gives me some better numbers. 1167 unmasked/sec and 417 masked/sec. Maybe 86Box is not so accurate as I thought.

Once I get my 5160 set up, I can get some accurate numbers on real hardware. I have a real IBM EGA card as well.

World's foremost 486 enjoyer.

Reply 50 of 61, by mkarcher

Posted on 2025-05-06, 23:46

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3302
Joined: 2019-01-19, 16:29
Location: Germany

keenmaster486 wrote on 2025-05-06, 21:40:

mkarcher wrote on 2025-05-06, 17:52:

Your target system seems to be a 5160 with an original EGA card. For that card, you should be aware of its low overall performance. In 320-pixel graphics mode, you get one chance to do an 8-bit read or write per unit of 16 pixels displayed. A unit of 16 pixels takes 32 clocks of the 14.318 crystal. A processor clock takes three clocks of the same crystal, so you get a chance to read or write a byte around every 10 processor clocks. This might actually be the reason ehy the xchg trick doesn't help. After finishing the read performed by XCHG, the write can not be finished earlier than 10 processor clocks later. As in the optimal case (no idea whether the EGA can hit it) a bus cycle takes 4 clocks, you have at least 6 clocks wasted if you try to perform back-to-back cycles to EGA RAM (whether read or write).

So this is making me think - does that mean I can calculate the theoretical maximum number of unmasked tiles that can be displayed per frame?

Yes, exactly. Maybe I mixed up some stuff though when I claimed one read or write per 16 pixels while in fact it might be one write per 16 clocks (not 32).

keenmaster486 wrote on 2025-05-06, 21:40:

I can write a maximum of 20*200=4000 bytes per vertical refresh which makes 8000 bytes per frame

This calculation is wrong by omitting the blanking periods. You don't have 20, but 28.5 "units of 16 pixels" per frame. Furthermore, you don't have 200, but 262 lines per frame. This already increases the available bandwidth to nearly 15000 bytes per 30Hz frame. Furthermore, during blanking, the EGA card might actually be able to allocate more bandwidth to the bus. Internally, it does 3 RAM cycles per 16 pixels, two of which are needed for display purpose and the third one is reserved for the ISA bus. During blanking, all three cycles can be used for the bus. But as I already calculated, it takes 10 processor clocks for a 16-pixel interval, and each bus cycle of the processor takes 4 processor clocks. You can't fit 12 processor clocks (3 bus cycles) into 10 processor clocks (the time for a 16-pixel unit), so you will surely miss every other chance to access video memory during blanking, going down to 1.5 times the rate of the rate available during video display. So we get 8000 bytes during active display, and just short of 7000*1.5 during blanking, which is nearly 18500 bytes per frame, assuming I didn't mess up the initial assumption.

keenmaster486 wrote on 2025-05-06, 21:40:

and each tile is 128 bytes, then the theoretical maximum for the EGA card itself would be 62.5 tiles per frame.

With 18401 bytes per frame, this will be 144 tiles per frame.

keenmaster486 wrote on 2025-05-06, 21:40:

This certainly doesn't play out in 86Box, since when I set the machine type to 486DX2/66 with an "IBM EGA" card as the display adapter type, I get 7488 unmasked tiles/sec

At a 486, you might be able to hit all the bus slots during blanking (if the bus controller isn't adding wait states for compatiblity purposes), resulting in 28800 bytes/frame or 225 tiles per frame (this would be 864KB/s), which you surely can't reach in practice because 5% to 10% of the ISA bus bandwidth is wasted by refresh cycles on the bus, which are there because the ISA bus was meant to have DRAM cards expanding the 64KB that fits on the original 5150 mainboard, without having a refresh controller on each RAM card. 800KB/s is observed with old 8-bit VGA cards, but it's way higher than what you can expect from an EGA card in an XT (possibly you could get REP STOSW performance with a V20 to that rate, but nothing else).

keenmaster486 wrote on 2025-05-06, 21:40:

which comes out to about 249 tiles/frame at EGA 30fps.

This indeed looks quite high, especially as the estimations assumed all time is spent on writing to the EGA card. Even if everything is optimal, more than 210 tiles/frame is the upper limit (dropped 225 to 210 due to RAM refresh), and adding that you likely won't get the bus busy 100% on EGA writes, I estimate 170-200 frames per second as the highest you could achieve on a real EGA card in a real 486 system with sufficiently low 8-bit memory wait states. 250 tiles/frame makes me question whether the emulator is faithfully emulating all the bottlenecks that occur in that system.

Reply 51 of 61, by mkarcher

Posted on 2025-05-06, 23:55

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3302
Joined: 2019-01-19, 16:29
Location: Germany

keenmaster486 wrote on 2025-05-06, 23:25:

MartyPC gives me some better numbers. 1167 unmasked/sec and 417 masked/sec. Maybe 86Box is not so accurate as I thought.

Once I get my 5160 set up, I can get some accurate numbers on real hardware. I have a real IBM EGA card as well.

MartyPC is clearly the emulator you want to use for cycle-accurate measurement of what to expect from a 5150/5160 system. The "up to 210 unmasked/frame", i.e. 6300 unmasked/sec in the 486 calculation assumed that basically everything is blocked on the EGA writes, which might be justified on a 486 running everything else from its L1 cache, but is completely off limits for the 5160, as that system needs to fetch all the opcode bytes over and the tile data bytes the ISA bus as well (even the mainboard RAM behaves performance-wise as if it were connected to the ISA bus). Furthermore, I already established that on a 5160, you shouldnt assume 225 minus refresh, but just 144 minus refresh, so something like 135 unmasked/frame (which is 4050/sec). So currently the performance of "just saturating the ISA bus with EGA memory writes" is around 3.5 times of the speed you get when drawing tiles. This sounds plausible for the 5160.

Reply 52 of 61, by keenmaster486

Posted on 2025-05-08, 01:18

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

mkarcher wrote on 2025-05-06, 23:46:

This calculation is wrong by omitting the blanking periods. You don't have 20, but 28.5 "units of 16 pixels" per frame. Furthermore, you don't have 200, but 262 lines per frame. This already increases the available bandwidth to nearly 15000 bytes per 30Hz frame. Furthermore, during blanking, the EGA card might actually be able to allocate more bandwidth to the bus. Internally, it does 3 RAM cycles per 16 pixels, two of which are needed for display purpose and the third one is reserved for the ISA bus. During blanking, all three cycles can be used for the bus. But as I already calculated, it takes 10 processor clocks for a 16-pixel interval, and each bus cycle of the processor takes 4 processor clocks. You can't fit 12 processor clocks (3 bus cycles) into 10 processor clocks (the time for a 16-pixel unit), so you will surely miss every other chance to access video memory during blanking, going down to 1.5 times the rate of the rate available during video display. So we get 8000 bytes during active display, and just short of 7000*1.5 during blanking, which is nearly 18500 bytes per frame, assuming I didn't mess up the initial assumption.

Yeah I didn't think about the blanking periods.

18500 bytes per frame does sound about right.

mkarcher wrote on 2025-05-06, 23:55:
for the 5160, as that system needs to fetch all the opcode bytes over and the tile data bytes the ISA bus as well (even the mainboard RAM behaves performance-wise as if it were connected to the ISA bus).

I've been reading about the 8088's prefetch queue. So far I've been unable to achieve any performance improvements that I can definitively trace to keeping the prefetch queue full, though. Not really sure what I'm doing there.

I have made a breakthrough on the masked tiles. Here's the secret: store the tiles in memory in packed plane format, i.e. 4-byte chunks representing the four planes for each byte in video memory. In my case, though, it's 5-byte chunks: the first byte is the mask. So I'm able to load both the mask value and the blue plane value with a single lodsw, and then I can use xchg to both load the latches and write the blue plane to the screen. Subsequent plane writes are done with movsb. The tradeoff is that I have to execute dec di for the green and red planes.

I also realized that I only have to update dl when I change which port I'm writing to, which makes port writes a bit faster, and sped up the unmasked routine slightly to 1169 tiles/sec in MartyPC.

My masked routine now looks like this, and scores 538 tiles/sec in MartyPC:

1mov si, dataOffset
2mov di, memoffset
3mov ax, 0A000h
4mov es, ax
5
6mov dx, 3CEh
7mov al, 08h
8out dx, al
9
10mov dl, 0C4h
11mov al, 02h
12out dx, al
13
14// NOTE: repeat the following block 16 times
15mov dl, 0CFh
16lodsw
17out dx, al
18mov dl, 0C5h
19mov al, 01h
20out dx, al
21xchg ah, es:[di]
22mov al, 02h
23out dx, al
24movsb
25dec di
26mov al, 04h
27out dx, al
28movsb
29dec di
30mov al, 08h
31out dx, al
32movsb
33mov dl, 0CFh
34lodsw
35out dx, al
36mov dl, 0C5h
37mov al, 01h
38out dx, al
39xchg ah, es:[di]
40mov al, 02h
41out dx, al
42movsb
43dec di
44mov al, 04h
45out dx, al
46movsb
47dec di
48mov al, 08h
49out dx, al
50movsb
51add di, 42

World's foremost 486 enjoyer.

Reply 53 of 61, by keenmaster486

Posted on 2025-05-08, 02:47

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

Here's what the debug output of MartyPC looks like for the first byte of a row and most of the second byte for this masked routine:

The attachment Screenshot from 2025-05-07 20-45-40.png is no longer available

Interesting that there is so much variation in the cycle count for identical instructions. I'm assuming this has to do with the prefetch queue and wait states.

World's foremost 486 enjoyer.

Reply 54 of 61, by mkarcher

Posted on 2025-05-08, 21:19

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3302
Joined: 2019-01-19, 16:29
Location: Germany

keenmaster486 wrote on 2025-05-08, 01:18:

I've been reading about the 8088's prefetch queue. So far I've been unable to achieve any performance improvements that I can definitively trace to keeping the prefetch queue full, though. Not really sure what I'm doing there.

The issue is likely that you can't keep the queue full. The 8088 has an undersized bus interface. The design of the execution unit is made for the 8086 which has twice the bandwidth, as it can fetch two bytes at once. Looking at your MartyPC screenshort, I see "dec di" taking 2 cycles. Yet it took the processor 4 cycles to fetch it (one byte). And I see "mov al, 4" listed at 4 cycles. Yet it took the processor 8 cycles to fetch that instruction (2 bytes). Looking at CB69, it took 8 cycles to execute "mov al,1", which seems to indicate that at this point the queue ran completely empty. I'm surprised the "out dx,al" instruction is that slow (14 cycles). The 8088 timing reference I have at hand lists that instruction as "8 cycles". These tables always assume that the prefetch queue is "sufficiently filled", which it is clearly not after you had a "mov al,1" instruction consuming 8 cycles. So you need 4 extra cycles to fetch the out instruction, which would arrive at 12 cycles (4 for fetching, then 8 for executing). I guess the reason is that the prefetching unit already started fetching a second byte before the OUT instruction could claim the bus, so the OUT instruction has to wait for the FSB to become idle. The subsequent XCHG instruction is most likely slowed down primarily by the EGA memory wait states.

As your code is (as most 8088 code) mostly bottlenecked by the FSB, it is really unfortunate that the IBM XT mainboard is unable to do write posting to ISA cards, which would be a big win here. If the EGA card is currently unable to accept a byte (video RAM is busy for display purposes), it blocks the bus until it is ready, which will not only prevent further progress in the execution unit, but it will also prevent the prefetch unit to fetch further instructions. If the EGA card on the ISA bus was "further away" from the processor than the RAM (like it is on 486 computers), you could have the chipset continue a write to EGA memory in the background on the ISA bus, while the processor is able to access memory on its frontside bus. Alas, the complexity of the XT mainboard is way too low for stuff like this, and having all RAM appear on the only bus the system has (the "ISA" bus) is a design feature, so it is impossible to separate EGA memory writes from RAM access.

I don't know MartyPC (except from reading VOGONs threads about the progress of emulating Area 5150), so I don't know whether you are able to get further details on the timing (like how many clocks the execution unit had to wait to perform a data access, as the prefetch unit was currently in progress of prefetching; like how many wait states you had on the ISA bus; like whether you lost some cycles because the ISA bus was handed over to the DMA controller for memory refresh). Seeing those details would greatly help you to understand why the same instruction is observed at different execution times.

keenmaster486 wrote on 2025-05-08, 01:18:

I have made a breakthrough on the masked tiles. Here's the secret: store the tiles in memory in packed plane format, i.e. 4-byte chunks representing the four planes for each byte in video memory. In my case, though, it's 5-byte chunks: the first byte is the mask. So I'm able to load both the mask value and the blue plane value with a single lodsw, and then I can use xchg to both load the latches and write the blue plane to the screen. Subsequent plane writes are done with movsb. The tradeoff is that I have to execute dec di for the green and red planes.

Have you compared the xchg approach (3 bytes) to "lodsb; mov es:[di], ah" (4 bytes) or "lodsb; mov al,ah; stosb; dec di" (5 bytes)? If you are bottlenecked on prefetching, the xchg instruction is the best approach, but if delays in the execution unit are also significant, you might get an improvement using the 4-byte approach. I guess the 5 byte instruction sequence will be the worst of all.

By the way, are you aware that the Microsoft macro assembler is able to expand macros and generate repeats? Using the repeat feature might make maintaining your code easier, unless you already generate it using some external tool (e.g. a python script).

keenmaster486 wrote on 2025-05-08, 01:18:

I also realized that I only have to update dl when I change which port I'm writing to, which makes port writes a bit faster, and sped up the unmasked routine slightly to 1169 tiles/sec in MartyPC.

Which is a clear indication that you are indeed bottlenecked by instruction fetches. The time spent by the execution unit on "MOV DX, 3CF" and "MOV DL, CF" is identical. But you need an extra bus cycle (i.e. four clocks on the bus) to fetch the instruction that updates the full DX register.

keenmaster486 wrote on 2025-05-08, 01:18:

My masked routine now looks like this, and scores 538 tiles/sec in MartyPC:

I took a look at it. I see no glaringly obvious way to improve it.

Reply 55 of 61, by GloriousCow

Posted on 2025-05-21, 03:01

GloriousCow Offline

Rank Member

Rank: Member
Posts: 487
Joined: 2022-09-12, 20:00

keenmaster486 wrote on 2025-05-08, 01:18:

I've been reading about the 8088's prefetch queue. So far I've been unable to achieve any performance improvements that I can definitively trace to keeping the prefetch queue full, though. Not really sure what I'm doing there.

This is very hard to do. I've heard from a few people who were excited about MartyPC trying to give it a go, and didn't really have any luck. The queue is teensy and the bus is the main bottleneck. I did some experiments with changing the queue size and leaving everything else about the BIU alone, and there was negligible impact reducing the queue size to 3 or increasing it to anything; i only noticed about a 10% dip in performance with a queue size of 2.

So yeah. It's just ... there, but it's far more effectively utilized on the 8086.

mkarcher wrote on 2025-05-08, 21:19:

The issue is likely that you can't keep the queue full. The 8088 has an undersized bus interface. The design of the execution unit is made for the 8086 which has twice the bandwidth, as it can fetch two bytes at once. Looking at your MartyPC screenshort, I see "dec di" taking 2 cycles. Yet it took the processor 4 cycles to fetch it (one byte). And I see "mov al, 4" listed at 4 cycles. Yet it took the processor 8 cycles to fetch that instruction (2 bytes). Looking at CB69, it took 8 cycles to execute "mov al,1", which seems to indicate that at this point the queue ran completely empty. I'm surprised the "out dx,al" instruction is that slow (14 cycles). The 8088 timing reference I have at hand lists that instruction as "8 cycles". These tables always assume that the prefetch queue is "sufficiently filled", which it is clearly not after you had a "mov al,1" instruction consuming 8 cycles. So you need 4 extra cycles to fetch the out instruction, which would arrive at 12 cycles (4 for fetching, then 8 for executing). I guess the reason is that the prefetching unit already started fetching a second byte before the OUT instruction could claim the bus, so the OUT instruction has to wait for the FSB to become idle. The subsequent XCHG instruction is most likely slowed down primarily by the EGA memory wait states.

There are interesting delays afoot in the BIU fetching logic, such as stalls at a queue length of 3. Plus I assume that DRAM refresh DMA is going on, which will randomly inject 1-6 wait states into anything every 72 cycles. I have been asked many times by folks wondering if MartyPC executed the wrong number of cycles and everytime we've looked into it, it's just the 8088 doin' 8088 stuff.

As you have noticed the published cycle timings are all "best case scenarios" you rarely ever hit in practice and so a lot of code people have carefully counted out from reference material has very different actual execution times in practice.

mkarcher wrote on 2025-05-08, 21:19:

I don't know MartyPC (except from reading VOGONs threads about the progress of emulating Area 5150), so I don't know whether you are able to get further details on the timing (like how many clocks the execution unit had to wait to perform a data access, as the prefetch unit was currently in progress of prefetching; like how many wait states you had on the ISA bus; like whether you lost some cycles because the ISA bus was handed over to the DMA controller for memory refresh). Seeing those details would greatly help you to understand why the same instruction is observed at different execution times.

You can toggle on full cycle-level tracing at any time, but I can't say anyone will understand the trace log format except me. I have yet to really write that documentation, but there is a video https://www.youtube.com/watch?v=cE8MihFf6OI&t=1s

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Reply 56 of 61, by keenmaster486

Posted on 2025-06-04, 20:34

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

Does anyone know how to execute a near jmp to a label in WASM (not a short jmp)? I cannot find any information about this anywhere on the internet.

World's foremost 486 enjoyer.

Reply 57 of 61, by BloodyCactus

Posted on 2025-06-04, 21:10

BloodyCactus Offline

Rank Oldbie

Rank: Oldbie
Posts: 1529
Joined: 2016-02-03, 13:34
Location: Lexington VA

a short jump is a near jump. in 16bit mode its +- 127 bytes. thats a short jump.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 58 of 61, by mkarcher

Posted on 2025-06-04, 21:29

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3302
Joined: 2019-01-19, 16:29
Location: Germany

keenmaster486 wrote on 2025-06-04, 20:34:

Does anyone know how to execute a near jmp to a label in WASM (not a short jmp)? I cannot find any information about this anywhere on the internet.

WASM (the Watcom assembler) is MASM (Microsoft Macro Assembler) compatible. The required syntax is "JMP NEAR label".

Reply 59 of 61, by keenmaster486

Posted on 2025-06-04, 21:33

keenmaster486 Offline

Rank l33t

Rank: l33t
Posts: 2951
Joined: 2016-02-16, 02:04
Location: Gnosticus IV

I tried using "jmp near <label>" (actually I'm using jnz, so "jnz near") but it would not compile - said "operator is expected".

In any case what I need is a jmp instruction to a label that is not limited to +/- 127.

World's foremost 486 enjoyer.

Main menu