pan069 wrote on 2025-05-03, 05:18:
Looking over your code the past few posts I noticed you're selecting the plane in the most inner part of your loop. Selecting the plane needs to write to a port (in assembler this is "out" instruction). Writing to ports is super slow so you want to do it the least amount of times.
If keenmaster486 is running the code on a typical 486 machine, this is likely correct. Also, this is one of the reasons the 16-color EGA programming model aged like milk. On the 8088 in the PC/XT, an I/O cycle and a memory cycle both took 4 FSB clocks (just performing that cycle, this doesn't include the FSB clocks to fetch the instruction that performs the cycle or delays that are introduced by the execution unit), so the base speed of I/O and memory is similar. On the EGA card, I/O cycles can be performed without extra wait states exceeding the 4 FSB clocks, i.e. in around 840ns on a 4.77MHz XT. On the other hand, video memory reads need to wait for the start of the next DRAM cycle that is not used for image refresh, which usually incurs extra wait states. So on the EGA in an XT, I/O writes are supposed to be faster than video memory reads!
Furthermore, an instruction like "mov al,es:[bx]", a typical video memory read instruction, is 3 bytes, which take 12 clocks to fetch. The execution of that instruction takes 8 clocks plus the time to evaluate ES:BX as address, which is 5 extra clocks. The segment override adds another two clocks, so the execution time of that instruction is 15 clocks! In contrast, the instruction "out dx,al" (assuming DX already contains 3CFh, and the index register already points to the bit mask register) takes just one byte (4 clocks to fetch), and has an execution time of 8 clocks! Note that the 8088 parallelizes execution and pre-fetching instructions, so you don't need to add the "fetch cycles" and the "execute cycles". Often, the 8088 is limited by its bus performance, so the execution clocks actually don't matter at all, and the sum of the fetch clocks plus the clocks spent on data transactions actually determine the execution speed. So again, the I/O instruction is a win here! The EGA programming model was very well suited to the performance characteristics of an XT.
With the 286, I/O was still quite fast. I/O-mapped cards like the NE2000 card or the IDE hard disk interface that could operate using "REP INSW" / "REP OUTSW" was not significantly slower than memory-mapped hardware - unless the memory-mapped hardware uses the 0WS line to perform "really fast" 286 bus cycles. It's for the newer systems that kept I/O slow for compatibility purposes with old cards, but still tried to get memory on the ISA bus reasonably fast that "I/O is slow" is actually true.
If you are not yet disgusted by the discussion of the 8088 "performance", there are some tricks effiecient EGA software can pull off. For example, if you need to fill the latches to do a masked operation and then write the new value, you can use a single machine instruction to perform both! Use "XCHG AL, ES:[BX]". This will first read the old value at ES:[BX], store it in a temporary microcode register, then write AL to ES:[BX] and finally write the contents of the microcode register to AL, so it causes both a read and a write cycle without paying the 5 clocks for address calculation twice. If you want to write 0FFh and need to fill the latches before, just use "OR ES:[BX], 0FFh". If you need to write 0, use "AND ES:[BX], 0". If you don't care about whats read and written (e.g. everything is set up using the mask register and the set/reset registers, you can use "INC ES:[BX]".
As you see, getting an 8088 to perform "quite good" on EGA is possible, but it requires careful choice of every assembler instruction in the inner loop. And, as long as you have an EGA card with EGA-typical memory read performance, be way more afraid of video RAM reads than of IO writes. This applies to 8-bit EGA in 486 computers just as it applied to 8-bit EGA cards in XT.