@zorko Thanks for your efforts here. Sorry I have taken so long to reply, but it is sometimes quite difficult to keep up with everything.
I agree that a sprite should not be a multiple of 4 pixels wide. One way to deal with this is to add an additional "colour" in the sprite which does not really correspond to a colour at all, but which means the pixel is transparent. This is not easy to do efficiently.
Another way is to supply a bit mask along with the sprite and only alter pixels which are marked in the mask. Of course this means taking twice the memory for a sprite, as the mask takes up the same amount of space as the sprite itself. However, this is better than having four copies of the sprite (one for each x position modulo 4).
I once read a comment on Hacker News that said it was possible to mask the edges of a sprite using just one additional AND instruction per byte, adding just 4 cycles per byte. The way he said he did it was to pass a 1-D array of bits which was as wide as the sprite and to use this as a mask for each line of the sprite.
I spent many, many hours trying to figure out how to do this with a sprite compiler, but I have not been successful. I really don't believe it is possible on the 8088/8086. (If someone knows how to do this, I would absolutely love to hear about it. )
However, we are not using a sprite compiler here. If the sprite is written one column of 4 pixels at a time then it is possible to use the same mask, corresponding to that column, over and over again without reading the mask from the 1-D array each time.
However, I still don't see how to do this with just one additional AND instruction. One must AND the pixels from video memory with the complement of the mask and then mask the pixels from the sprite with the mask. Even if the mask byte and its complement are in registers already so that one doesn't have to compute them over and over again, there are still two additional AND instructions.
And I don't see how one can move down a column instead of across a row without incurring additional costs due to the odd/even layout of screen lines in CGA memory.
If anyone has any ideas about this, I'd really like to hear them.