VOGONS


Pulling my hair out with EGA programming.

Topic actions

Reply 20 of 55, by mkarcher

User metadata
Rank l33t
Rank
l33t
keenmaster486 wrote on 2025-04-29, 23:06:

Well, I'm not selecting any read planes.

Which means you likely always read plane 0. So the masking you do is most likely pointless, because the EGA masking works as intended, and the EGA card ignores the masked pixels.

keenmaster486 wrote on 2025-04-29, 23:06:
[…]
Show full quote
			vmem[m + i] = (vmem[m + i] & ~this->data[t][maskDataPos + i])|this->data[t][dataPos + offs + i];

This obviously works by reading the 4 planes to the latch(es), then adding the bits you want for the selected plane, and writing just the selected plane back to EGA memory. I guess your issue was that you did not acutally read from EGA memory to fill the latches, because in your other attempt, the compiler noticed that you don't use the value you read from EGA memory, and optimized the read access away. You can likely prevent the compiler from "optimizing" your EGA memory access by declaring the vmem pointer as "pointer-to-volatile", like "volatile unsigned char far* vmem". Omit far if you are not in a 16-bit program.

Reply 21 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t
mkarcher wrote on 2025-04-29, 23:14:

This obviously works by reading the 4 planes to the latch(es), then adding the bits you want for the selected plane, and writing just the selected plane back to EGA memory. I guess your issue was that you did not acutally read from EGA memory to fill the latches, because in your other attempt, the compiler noticed that you don't use the value you read from EGA memory, and optimized the read access away. You can likely prevent the compiler from "optimizing" your EGA memory access by declaring the vmem pointer as "pointer-to-volatile", like "volatile unsigned char far* vmem". Omit far if you are not in a 16-bit program.

Sure enough, this is exactly what was happening. Thanks for the tip. Changing the vmem pointer to volatile fixed the bizarre behavior.

World's foremost 486 enjoyer.

Reply 22 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t

Okay so scratch that - I still have to perform those bitwise operations in order for the masking to work. I didn't notice it because I was so excited that it finally drew without being corrupted beyond recognition.

If I don't, the masked portions are drawn as black.

I still don't really understand why. Setting the bitmask should already accomplish the masking. And if I don't set the bitmask, the masked portions get corrupted. I have to set the bitmask and perform those bitwise operations on the vmem - which as mkarcher pointed out, is being performed on the first plane, the blue plane, every time - unless I'm missing something as regards which plane you read from memory when you've only set the write plane.

Truly bizarre behavior.

World's foremost 486 enjoyer.

Reply 23 of 55, by mkarcher

User metadata
Rank l33t
Rank
l33t
keenmaster486 wrote on 2025-05-01, 15:20:

Okay so scratch that - I still have to perform those bitwise operations in order for the masking to work. I didn't notice it because I was so excited that it finally drew without being corrupted beyond recognition.

If I don't, the masked portions are drawn as black.

This sounds like the read to fill the latches still doesn't work as intended, and the latches now happen to contain "black". I wonder how exactly you try to read - even with a volatile pointer. IIRC there is disagreement between C and C++ what exactly causes a read of a volatile variable.

volatile unsigned char far* x = MK_FP(0xA000,0);
unsigned char y;
x[15] = 3; // obviously a volatile write
y = x[15]; // obviously a volatile read
x[15]; // what's this? Mentioning the address so we can write to? Or a dummy read?

You might want to verify with a disassembler (e.g. the one integrated into a debugger) that the compiler actually generates the required read instructions to fill the latches.

Reply 24 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t
mkarcher wrote on 2025-05-01, 21:28:

You might want to verify with a disassembler (e.g. the one integrated into a debugger) that the compiler actually generates the required read instructions to fill the latches.

I may have to do that. Here's how I'm doing the read:

unsigned char oldByte;
oldByte = vmem[m];

World's foremost 486 enjoyer.

Reply 25 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t

Figured it out. Needed to declare oldByte as volatile as well. Now it works.

Edit: realized I can also just read vmem back into itself. That also works.

I'm having a different problem now though, that has come up recently. When I test it on an emulator set to EGA, the unmasked routine seems to be unable to draw black areas over the top of masked tiles. But this does not happen in DOSBox, even when the machine type is set to "ega".

World's foremost 486 enjoyer.

Reply 26 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t

It's actually slightly faster (about 3%) to read vmem into a char 32 times than it is to read vmem into itself 32 times. Interesting.

World's foremost 486 enjoyer.

Reply 27 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t

Don't laugh. But how can I optimize these drawing routines further? I'm assuming I need to make the move to assembly language but I'm totally lost as to where to start there and I'm not even sure how much I can get away with using inline assembler and feeding it these pointers... I have visions of overwriting registers that are being used elsewhere, etc.

(note: forgot to mention I'm not able to use write mode 1, because I'm going to rely on video memory wraparound for the scrolling effect)

I wrote a simple benchmarking function and I'm getting 609 unmasked/sec and 303 masked/sec on an emulated 4.77 MHz XT. This isn't fast enough. I need to be able to push maybe 50 tiles to the screen per frame, which comes out to 1500 tiles/sec at half the EGA's vertical refresh rate, 30Hz, which is my target FPS. Of course it may not be possible to make the 8088 do this, but I'm setting my goal high so it just ends up being "as fast as I can make it".

I say don't laugh because I unrolled the drawing routines. I can't seem to get the compiler to do that for me. It's always faster completely unrolled. Never mind that it adds like 30KB to the executable - another stupid problem.

So they're kind of hard to read but after trying every possible way of doing this and racking my brain I cannot find any faster way just writing high level code, and to be honest it is indeed very fast. In fact I'm already surprised I was able to get it this fast without writing the fabled "optimized assembler".

These are 16x16 pixel tiles stored plane by plane in row-major order, BGRI(M) plane order. I have a 16 pixel (2 byte) buffer around the edge of the screen, so I have to increment the vmem pointer by 44 each row - or 22 for the unmasked routine where I can get away with using ints to copy two bytes at a time, at least from the perspective of this high level code. It actually is faster that way so I'm assuming it compiles to something that takes less cycles.

I have all the optimization settings on the compiler turned all the way up and I've messed with them ad nauseum to try to get the best results.

Unmasked routine:

unsigned int* temp = (unsigned int *)&vmem[memoffset];
unsigned int* temp2 = (unsigned int *)this->data[t];
outp(0x3C4, 0x02);
//=== NOTE: repeat this block for each plane ===
outp(0x3C5, 0x01);
//===NOTE: repeat this block 16 times
*temp = *temp2; temp2++; temp += 22;
//=================================
temp -= 352;
//=================================

Masked routine:

// Prepare EGA registers:
outp(0x3CE, 0x08);
outp(0x3C4, 0x02);

// Normal routine using pointer arithmetic:
volatile unsigned char *temp = &vmem[memoffset];
unsigned char *temp2 = this->data[t];
volatile unsigned char temp3;

// If I put this in a for loop I can't seem to get the compiler to unroll it.
// It always performs much better when manually unrolled.

//===NOTE: repeat this block 16 times===
outp(0x3CF, *(temp2 + 128)); temp3 = *temp; outp(0x3C5, 0x01); *temp = *temp2; outp(0x3C5, 0x02); *temp = *(temp2 + 32); outp(0x3C5, 0x04); *temp = *(temp2 + 64); outp(0x3C5, 0x08); *temp = *(temp2 + 96); temp++; temp2++;
outp(0x3CF, *(temp2 + 128)); temp3 = *temp; outp(0x3C5, 0x01); *temp = *temp2; outp(0x3C5, 0x02); *temp = *(temp2 + 32); outp(0x3C5, 0x04); *temp = *(temp2 + 64); outp(0x3C5, 0x08); *temp = *(temp2 + 96); temp += 43; temp2++;
//============================

// Reset bitmask:
outp(0x3CF, 0xFF);

World's foremost 486 enjoyer.

Reply 28 of 55, by pan069

User metadata
Rank Oldbie
Rank
Oldbie
keenmaster486 wrote on 2025-05-02, 23:04:

It's actually slightly faster (about 3%) to read vmem into a char 32 times than it is to read vmem into itself 32 times. Interesting.

That's probably because writes to vram are super slow whereas writes to main memory are pretty fast (in comparison). So. vram read -> mem write = fast, vram read -> vram write = slow.

Looking over your code the past few posts I noticed you're selecting the plane in the most inner part of your loop. Selecting the plane needs to write to a port (in assembler this is "out" instruction). Writing to ports is super slow so you want to do it the least amount of times. One technique is, rather than unrolling your loops, move the plane selecting to the outermost loop.

To do this, have your sprites in memory already separated out by plane. So, rather than having a buffer where the first byte contains the first two pixels, have 4 separate buffers, each buffer containing a separate plane that makes up your sprite. A nice side effect of this is that if you organise your colors in such a way that a sprite might only need 2 or maybe 3 planes allowing for less processing time for that sprite.

If you want performance you need to make the jump to assembler. This might initially a be a bit of a learning curve, but don't be discouraged, I found assembler super easy to understand once I understand that all I'm doing is moving memory around into a cpu register, add or subtract a value on that register, and move it back to memory. Once you get the hang of the basics it's pretty straight forward.

Out of curiosity, are you familiar with this document at all? If not, it might be a good read:

https://cosmodoc.org/

Reply 29 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t
pan069 wrote on 2025-05-03, 05:18:
keenmaster486 wrote on 2025-05-02, 23:04:

It's actually slightly faster (about 3%) to read vmem into a char 32 times than it is to read vmem into itself 32 times. Interesting.

That's probably because writes to vram are super slow whereas writes to main memory are pretty fast (in comparison). So. vram read -> mem write = fast, vram read -> vram write = slow.

Makes sense. I guessed it was something of that nature.

pan069 wrote on 2025-05-03, 05:18:
Looking over your code the past few posts I noticed you're selecting the plane in the most inner part of your loop. Selecting th […]
Show full quote

Looking over your code the past few posts I noticed you're selecting the plane in the most inner part of your loop. Selecting the plane needs to write to a port (in assembler this is "out" instruction). Writing to ports is super slow so you want to do it the least amount of times. One technique is, rather than unrolling your loops, move the plane selecting to the outermost loop.

To do this, have your sprites in memory already separated out by plane. So, rather than having a buffer where the first byte contains the first two pixels, have 4 separate buffers, each buffer containing a separate plane that makes up your sprite. A nice side effect of this is that if you organise your colors in such a way that a sprite might only need 2 or maybe 3 planes allowing for less processing time for that sprite.

If you want performance you need to make the jump to assembler. This might initially a be a bit of a learning curve, but don't be discouraged, I found assembler super easy to understand once I understand that all I'm doing is moving memory around into a cpu register, add or subtract a value on that register, and move it back to memory. Once you get the hang of the basics it's pretty straight forward.

Out of curiosity, are you familiar with this document at all? If not, it might be a good read:

https://cosmodoc.org/

So yeah that all is exactly what I’m doing. If you look at my unmasked routine, it selects a plane and then writes all the data for that plane at once.

However, I just realized you’re right about the masked routine. I kept thinking it was the same whether I set the bitmask or select the planes in the inner loop, but now that I think about it it’s 160 port writes the way I’m doing it now, and 132 if I set the bitmask in the inner loop instead.

I guess I need to take a crack at using assembly for this.

And yes I have been using that doc on Cosmo — it’s been a great resource.

World's foremost 486 enjoyer.

Reply 30 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t

Okay so I tried moving plane writes to the outer loop on the masked routine, and that actually makes it slower, since even though you have 32 less port writes, you have to do 32 more reads from video memory to load the latches.

World's foremost 486 enjoyer.

Reply 31 of 55, by wbahnassi

User metadata
Rank Oldbie
Rank
Oldbie
keenmaster486 wrote on 2025-05-02, 23:04:

It's actually slightly faster (about 3%) to read vmem into a char 32 times than it is to read vmem into itself 32 times. Interesting.

pan069 wrote on 2025-05-03, 05:18:

That's probably because writes to vram are super slow whereas writes to main memory are pretty fast (in comparison). So. vram read -> mem write = fast, vram read -> vram write = slow.

Reading vmem into char is not read data from vram to main mem. It is read data from vram to register. While reading vmem into itself is a read and a write operation to vram (two memory accesses). Reading to char is basically one memory access, so it has to be faster.

Turbo XT 12MHz, 8-bit VGA, Dual 360K drives
Intel 386 DX-33, TSeng ET3000, SB 1.5, 1x CD
Intel 486 DX2-66, CL5428 VLB, SBPro 2, 2x CD
Intel Pentium 90, Matrox Millenium 2, SB16, 4x CD
HP Z400, Xeon 3.46GHz, YMF-744, Voodoo3, RTX2080Ti

Reply 32 of 55, by mkarcher

User metadata
Rank l33t
Rank
l33t
pan069 wrote on 2025-05-03, 05:18:

Looking over your code the past few posts I noticed you're selecting the plane in the most inner part of your loop. Selecting the plane needs to write to a port (in assembler this is "out" instruction). Writing to ports is super slow so you want to do it the least amount of times.

If keenmaster486 is running the code on a typical 486 machine, this is likely correct. Also, this is one of the reasons the 16-color EGA programming model aged like milk. On the 8088 in the PC/XT, an I/O cycle and a memory cycle both took 4 FSB clocks (just performing that cycle, this doesn't include the FSB clocks to fetch the instruction that performs the cycle or delays that are introduced by the execution unit), so the base speed of I/O and memory is similar. On the EGA card, I/O cycles can be performed without extra wait states exceeding the 4 FSB clocks, i.e. in around 840ns on a 4.77MHz XT. On the other hand, video memory reads need to wait for the start of the next DRAM cycle that is not used for image refresh, which usually incurs extra wait states. So on the EGA in an XT, I/O writes are supposed to be faster than video memory reads!

Furthermore, an instruction like "mov al,es:[bx]", a typical video memory read instruction, is 3 bytes, which take 12 clocks to fetch. The execution of that instruction takes 8 clocks plus the time to evaluate ES:BX as address, which is 5 extra clocks. The segment override adds another two clocks, so the execution time of that instruction is 15 clocks! In contrast, the instruction "out dx,al" (assuming DX already contains 3CFh, and the index register already points to the bit mask register) takes just one byte (4 clocks to fetch), and has an execution time of 8 clocks! Note that the 8088 parallelizes execution and pre-fetching instructions, so you don't need to add the "fetch cycles" and the "execute cycles". Often, the 8088 is limited by its bus performance, so the execution clocks actually don't matter at all, and the sum of the fetch clocks plus the clocks spent on data transactions actually determine the execution speed. So again, the I/O instruction is a win here! The EGA programming model was very well suited to the performance characteristics of an XT.

With the 286, I/O was still quite fast. I/O-mapped cards like the NE2000 card or the IDE hard disk interface that could operate using "REP INSW" / "REP OUTSW" was not significantly slower than memory-mapped hardware - unless the memory-mapped hardware uses the 0WS line to perform "really fast" 286 bus cycles. It's for the newer systems that kept I/O slow for compatibility purposes with old cards, but still tried to get memory on the ISA bus reasonably fast that "I/O is slow" is actually true.

If you are not yet disgusted by the discussion of the 8088 "performance", there are some tricks effiecient EGA software can pull off. For example, if you need to fill the latches to do a masked operation and then write the new value, you can use a single machine instruction to perform both! Use "XCHG AL, ES:[BX]". This will first read the old value at ES:[BX], store it in a temporary microcode register, then write AL to ES:[BX] and finally write the contents of the microcode register to AL, so it causes both a read and a write cycle without paying the 5 clocks for address calculation twice. If you want to write 0FFh and need to fill the latches before, just use "OR ES:[BX], 0FFh". If you need to write 0, use "AND ES:[BX], 0". If you don't care about whats read and written (e.g. everything is set up using the mask register and the set/reset registers, you can use "INC ES:[BX]".

As you see, getting an 8088 to perform "quite good" on EGA is possible, but it requires careful choice of every assembler instruction in the inner loop. And, as long as you have an EGA card with EGA-typical memory read performance, be way more afraid of video RAM reads than of IO writes. This applies to 8-bit EGA in 486 computers just as it applied to 8-bit EGA cards in XT.

Reply 33 of 55, by mkarcher

User metadata
Rank l33t
Rank
l33t
wbahnassi wrote on 2025-05-03, 22:19:

Reading vmem into char is not read data from vram to main mem. It is read data from vram to register.

As keenmaster486 had to made the target of the read-from-video-memory operation volatile to prevent the compiler from elimination the read, I guess the compiler feels forced to perform a main memory write as well, although you are correct that a main memory write is not necessary for EGA programming.

Reply 34 of 55, by wbahnassi

User metadata
Rank Oldbie
Rank
Oldbie

Yes I agree. Anyways I feel at this level one needs to always keep an eye on the generated instructions anyways.. especially in hot-spot areas like a blitter. Just put a breakpoint at that function and switch to disassembly to see what the compiler decided to do.

One can then play around the code to convince the compiler to produce a certain result.. or just give up and make it inline asm.

Turbo XT 12MHz, 8-bit VGA, Dual 360K drives
Intel 386 DX-33, TSeng ET3000, SB 1.5, 1x CD
Intel 486 DX2-66, CL5428 VLB, SBPro 2, 2x CD
Intel Pentium 90, Matrox Millenium 2, SB16, 4x CD
HP Z400, Xeon 3.46GHz, YMF-744, Voodoo3, RTX2080Ti

Reply 35 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t

I rewrote the unmasked tile draw code in assembly.

Keep in mind this is the first x86 assembly code I've ever written. I have no idea what I'm doing.

Amusingly, this routine is actually 10% slower than the high level routine above I wrote using pointer arithmetic. So I obviously have some work to do! But I'm pleased that I was actually able to write something in assembly that works at all.

I'm looking for ways to optimize this. I do have some questions:
1. Is there a better way to get the offset of a far pointer than casting it to an unsigned int as I'm doing here?
2. Is stosw the best way to go? It does make it so I have to keep track of my plane value in cl and move it back into al before writing it out to the port.

unsigned int dataOffset = (unsigned int)this->data[t];

// We assume that DS is set to the segment of this->data[t]
__asm {
mov ax, 0A000h
mov es, ax
mov di, memoffset
mov bx, dataOffset

mov dx, 3C4h
mov al, 02h
out dx, al
mov dx, 3C5h
mov cl, 01h
unmaskedplaneloop:
mov al, cl
out dx, al
mov si, 16
unmaskedlineloop:
mov ax, [bx]
stosw
add bx, 2
add di, 42
dec si
jnz unmaskedlineloop
sub di, 704
shl cl, 1
cmp cl, 10h
jne unmaskedplaneloop
}

World's foremost 486 enjoyer.

Reply 36 of 55, by riplin

User metadata
Rank Newbie
Rank
Newbie

Eliminate IO operations as much as possible. Try writing an entire plane and then switching to the next one, etc. Those IO operations are expensive.

Also in case you were, don’t read from video memory.

Reply 37 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t
riplin wrote on 2025-05-05, 22:14:

Eliminate IO operations as much as possible. Try writing an entire plane and then switching to the next one, etc. Those IO operations are expensive.

Also in case you were, don’t read from video memory.

That is what I'm doing. I write the entire plane at once. And in my last post there I'm working on the unmasked routine which never reads from video memory. The masked routine does have to read once per byte to fill the latches.

Also mkarcher has some interesting info about IO port writes above.

World's foremost 486 enjoyer.

Reply 38 of 55, by keenmaster486

User metadata
Rank l33t
Rank
l33t

Alright, so I unrolled that asm routine and now it's 12% faster than the unrolled pointer arithmetic routine. Making some progress I suppose. But I'm not sure how much better I can do than

mov ax, [bx]
stosw
add bx, 2
add di, 42

in the inner loop.

World's foremost 486 enjoyer.

Reply 39 of 55, by riplin

User metadata
Rank Newbie
Rank
Newbie

Sorry, I just blundered into this thread and saw that IO operation sitting outside your line loop. Is it not possible to do this at the sprite level or even scene level? Maybe it’s diminishing returns.

Have you calculated SOL (speed of light)? What’s the fastest possible you can achieve on this hardware? How far away are you from that?