# VOGONS

## CGA Graphics library

### Reply 60 of 85, by wbhart

Rank Newbie
Rank
Newbie

Actually, I have an idea how to fix these problems. I can switch the roles of al and ch. Then al will not be used in the non-a section and can be stored to immediate in CS so that the whole of ax is available as an accumulator in the non-a section. Then al can be restored from the immediate before the end of the non-a section.

I'm not sure if this would be faster, but it's worth a try to possibly shave a couple of cycles off.

### Reply 61 of 85, by wbhart

Rank Newbie
Rank
Newbie

I finally found some more time to work on the horizontalish part of the ellipse code. Not all the same tricks worked, so I ended up using SS and ES to store regs temporarily.

With the following register assignments:

; di, di+bx offsets of points above and below axis, ax: pixels
; dl: deltax (lo8), sp: deltax (hi16), bp: deltay (hi16),
; ch: D (lo8), cl: deltay (lo8), si: D (hi16)
; es: deltay (hi16) temp, ss: yinc

here is what the code looks like:

ellipse1_h4:
mov ah, [di+bx]
mov al, [di]
and ax, 0fcfch
ellipse1_patch37:
or ax, 0303h

ellipse1_patch38:
sub dl, 012h ; dx -= s^2
ellipse1_patch39:
sbb sp, 01234h
add ch, dl ; D += dx

shr bp, 1 ; if dy/2 < D, increment y
cmp bp, si
jge ellipse1_skip_y4

mov [di+bx], ah
mov [di], al

mov bp, es
sub ch, cl ; D -= dy
sbb si, bp
ellipse1_patch40:
add cl, 012h ; dy += 2r^2
ellipse1_patch41:

mov ax, ss ; update offset
xor ax, 0ffb0h ; update offset update for odd<->even
mov ss, ax
sub bx, 80 ; decrement/increment y lines

mov ah, [di+bx]
mov al, [di]
ellipse1_skip_y4:
mov es, bp

ellipse1_patch42:
sub dl, 012h ; dx -= s^2
ellipse1_patch43:
sbb sp, 01234h
ellipse1_doneh1_check:
jl ellipse1_doneh1

ellipse1_h3:
and ax, 0f3f3h
ellipse1_patch44:
or ax, 0c0ch

ellipse1_patch45:
sub dl, 012h ; dx -= s^2
ellipse1_patch46:
sbb sp, 01234h
add ch, dl ; D += dx

shr bp, 1            ; if dy/2 < D, increment y
cmp bp, si
jge ellipse1_skip_y3

mov [di+bx], ah
mov [di], al

mov bp, es
sub ch, cl ; D -= dy
sbb si, bp
ellipse1_patch47:
add cl, 012h ; dy += 2r^2
ellipse1_patch48:

mov ax, ss ; update offset
xor ax, 0ffb0h ; update offset update for odd<->even
mov ss, ax
sub bx, 80 ; decrement/increment y lines

mov ah, [di+bx]
mov al, [di]
ellipse1_skip_y3:
mov es, bp

ellipse1_patch49:
sub dl, 012h ; dx -= s^2
ellipse1_patch50:
sbb sp, 01234h
jl ellipse1_doneh1_check

ellipse1_h2:
and ax, 0cfcfh
ellipse1_patch51:
or ax, 03030h

ellipse1_patch52:
sub dl, 012h ; dx -= s^2
ellipse1_patch53:
sbb sp, 01234h
add ch, dl ; D += dx

shr bp, 1 ; if dy/2 < D, increment y
cmp bp, si
jge ellipse1_skip_y2

mov [di+bx], ah
mov [di], al

mov bp, es
sub ch, cl ; D -= dy
sbb si, bp
ellipse1_patch54:
add cl, 012h ; dy += 2r^2
ellipse1_patch55:

mov ax, ss ; update offset
xor ax, 0ffb0h ; update offset update for odd<->even
mov ss, ax
sub bx, 80 ; decrement/increment y lines

mov ah, [di+bx]
mov al, [di]
ellipse1_skip_y2:
mov es, bp

ellipse1_patch56:
sub dl, 012h ; dx -= s^2
ellipse1_patch57:
sbb sp, 01234h
jl ellipse1_doneh2

ellipse1_h1:
and ax, 03f3fh
ellipse1_patch58:
or ax, 0c0c0h

ellipse1_patch59:
sub dl, 012h ; dx -= s^2
ellipse1_patch60:
sbb sp, 01234h
add ch, dl ; D += dx

mov [di+bx], ah
mov [di], al
dec di

shr bp, 1 ; if dy/2 < D, increment y
cmp bp, si
jge ellipse1_skip_y1

inc di
mov bp, es
sub ch, cl ; D -= dy
sbb si, bp
ellipse1_patch61:
add cl, 012h ; dy += 2r^2
ellipse1_patch62:

mov ax, ss ; update offset
xor ax, 0ffb0h ; update offset update for odd<->even
mov ss, ax
sub bx, 80 ; decrement/increment y lines

ellipse1_skip_y1:
dec di
mov es, bp

ellipse1_patch63:
sub dl, 012h ; dx -= s^2
ellipse1_patch64:
sbb sp, 01234h
jl ellipse1_doneh2_skip ; skip extra byte

jmp ellipse1_h4

Each section is around 80 bytes, which means one of the short jumps has to jump to one of the others, but that only happens once, if at all and so is a negligible cost.

I now need to write the code to set all the millions of patches and to set up the initial registers and jump into the first loop, then the code to switch the regs around between the verticalish and horizontalish section and the code for writing the final pixel.

Hopefully I find time for that in the next few days and we can get some timings.

### Reply 62 of 85, by wbhart

Rank Newbie
Rank
Newbie

I discovered that my original Julia code had a dumb bug in it and the original algorithm doesn't quite work the way I thought.

Fortunately I've been able to adapt it to something very similar to what I had come up with that works for semiradii less than 100 except for the following:

26 94
44 39
75 96
82 29
91 57

But it requires that a parameter in the range [1..8] be chosen per ellipse, which either means constructing a table of such parameters, or finding a simple function which computes the correct parameter for a given set of semiradii. That's going to take some time to figure out.

What to do about the exceptional cases above, I don't yet know. I've tried various things, but nothing simple works so far. Probably the original ellipse code will have to be called in those cases. That will be needed anyway for ellipses with semiradius >= 100, for which I don't think just the top 16 bits in the comparisons is sufficient.

### Reply 63 of 85, by wbhart

Rank Newbie
Rank
Newbie

I discovered a way to do all semiradii up to the 160, 100 needed for a full screen ellipse. It's not perfect, with about 50 pairs of semiradii leading to a single pixel artifact, 1 leading to a two pixel artifact and one leading to a 3 pixel artifact. But I think all ellipses will still visually look ok.

I have the code almost written (it's a monster), but debugging will probably take a lot of effort, if it works at all.

The corrected Julia code is:

function ellipse2(A, r::Int32, s::Int32, n)
i = 1;
x = r;
y = 0;
r_orig = Int32(r)
c = Int32((s*s) << (n+8))
a = Int32((r*r) << (n+8))
D = Int32(0)
xdelta = Int32(2)*c*r_orig
ydelta = Int32(a)
while (reinterpret(UInt32, xdelta) >> 16) >= (reinterpret(Int32, ydelta) >> 16)
if i > 320; return 320; end
A[i] = (x, y); i += 1;
D += ydelta; ydelta += Int32(2)*a; y += 1;
if (D >> 16) >= (reinterpret(Int32, reinterpret(UInt32, xdelta) >> 17))
xdelta -= c; D -= xdelta; xdelta -= c; x -= 1;
end
end
D = -D
while (xdelta >> 16) >= 0
A[i] = (x, y); i += 1;
xdelta -= c; D += xdelta; xdelta -= c; x -= 1;
if (D >> 16) > (reinterpret(Int32, reinterpret(UInt32, ydelta) >> 17))
D -= ydelta; ydelta += Int32(2)*a; y += 1
end
end
return i - 1
end

where the parameter n can be computed as max(1, (r + s + 1)/32) for minimum artifacts. (This code writes the coordinates into an array A of pairs of Int's).

I'm simulating 24 bits for each of xdelta, ydelta and D here by using Int32's and shifting everything left by 8 bits. The right shifts by 16 bits simulate just taking the top 16 bits of the data as we discussed earlier.

I'd post the assembly code as well, but it's over 600 lines for half an ellipse! I should finish the code by tomorrow night, but it might take me until Sunday night to debug it.

### Reply 64 of 85, by wbhart

Rank Newbie
Rank
Newbie

I finally have the ellipse code working, at least for the right hand half of the ellipse.

So I can now give timings on the 8088 @ 4.77MHz. It's taking 156 cycles per pixel.

On the 8086 @ 8 MHz it takes 130 cycles per pixel.

I actually think that it could be faster to split the algorithm into two parts, the first of which computes an array that stores the increments (as single bits) and a second part that draws the points. The reason this might be faster is that it can take advantage of 4 way symmetry, rather than the two way symmetry I'm exploiting. I speculate it could be done in 100-120 cycles per pixel with this method, though obviously that is just a plain guess.

The current code is in the function cga_draw_ellipse1 in the file cga5.asm in my GitHub repository, linked at the beginning of this thread. It's exactly 700 lines of code for the right hand half of the ellipse. The left hand side will be about the same.

Last edited by wbhart on 2019-11-18, 00:59. Edited 3 times in total.

### Reply 65 of 85, by wbhart

Rank Newbie
Rank
Newbie

The previous fastest code I had for general ellipses on the 8088, prior to this, was 342 cycles per pixel, so the new code is more than twice as fast.

Update: the code for the full ellipse is now in the repository. It's 1400 lines of assembly code!

### Reply 67 of 85, by wbhart

Rank Newbie
Rank
Newbie

I wrote some code for drawing precomputed ellipses. It starts with an array of bits specifying which pixels of the verticalish and horizontalish parts should move horizontally/vertically respectively from their predecessors.

The result is pretty fast at 91 cycles per pixel, which is about a 50% improvement over computing the whole thing in real time.

I haven't written assembly code for doing the precomputation yet, so I just do that part in Julia. The code has a single ellipse with semiradii 100, 84 hard coded in an array in CS (called _ellipse_data). However, the ellipse can be placed anywhere on screen in any colour.

It would be possible to do the horizontalish part faster by precomputing entire bytes, but this would be messy due to the 4 pixel per byte CGA layout, which would mean you'd need much more space for precomputation if you wanted to be able to put an ellipse anywhere on the screen. Reading the full bytes from memory might end up being slower than just reading one bit per pixel, as I currently do. So it's not a guaranteed win.

I also didn't bother trying to use 4 way symmetry, for much the same reason. I don't think I will try this, though it could probably be slightly faster (for one hell of a lot of extra work).

The code is in my GitHub repo, linked at the beginning of the thread. I've now separated the assembly graphics routines into three files: line.asm, circle.asm, ellipse.asm, with the new code being in the latter in _cga_draw_ellipse_precomp1 for the right hand side of the ellipse, and _cga_draw_ellipse_precomp2 for the left hand side.

Writing code for precomputed lines is also possible, though I doubt it will be much of an improvement on just drawing the lines. One could of course use the trick of starting in the middle of the line and drawing two pixels at once, moving out from the centre (this is not new, it was done in very early ZX spectrum or Amstrad CPC games or demos, I forget which). Of course, there are two cases, depending whether the centre point of the line is a pixel or a pixel border.

I haven't decided whether I will code up this trick or not. It looks ok in games/demos that use it, so it might be worthwhile.

Of course I can also write routines for xoring ellipses, blanking ellipses and drawing them in (binary) colour 00 and 11. There could be some additional speedups possible here, as I think one ends up with an extra register being available.

All of the routines I've written so far turn off the interrupts. That is not so nice for games or demos with music or tricks involving precise timing.

In case anyone wants the Julia code for precomputing the ellipses, here it is:

function ellipse(A, B, r::Int, s::Int)
i = 1;
x = r;
y = 0;
r_orig = r
c = s*s
a = r*r
D = 0
xdelta = 2*c*r_orig
ydelta = a
while xdelta >= ydelta
A[i] = (x, y); i += 1; #println("(", x, ", ", y, ")")
D += ydelta; ydelta += 2a; y += 1;
if D >= div(xdelta, 2)
xdelta -= c; D -= xdelta; xdelta -= c; x -= 1;
end
end
D = -D
j = 1
while xdelta >= 0
B[j] = (x, y); j += 1; #println("(", x, ", ", y, ")")
xdelta -= c; D += xdelta; xdelta -= c; x -= 1;
if D > div(ydelta, 2)
D -= ydelta; ydelta += 2a; y += 1
end
end
return i - 1, j - 1
end

function ellipse_precomp(C, A, m::Int, B, n::Int)
x = A[1][1]
r = (m - 1) % 8
if r == 0
r = 8
q = div(m - 9, 8)
else
q = div(m - r - 1, 8)
end
C[1] = q + 1
C[2] = r
bj = UInt8(1)
b = UInt8(0)
for j = 1:r
if A[j + 1][1] != x
b += bj
end
bj <<= 1
x = A[j + 1][1]
end
C[3] = b
for i = 1:q
bj = UInt8(1)
b = UInt8(0)
for j = 1:8
if A[(i - 1)*8 + j + r + 1][1] != x
b += bj
end
bj <<= 1
x = A[(i - 1)*8 + j + r + 1][1]
end
C[i + 3] = b
end
v = q + 3
y = A[m][2]
r = n % 8
if r == 0
r = 8
q = div(n - 8, 8)
else
q = div(n - r, 8)
end
C[v + 1] = q + 1
C[v + 2] = r
bj = UInt8(1)
b = UInt8(0)
for j = 1:r
if B[j][2] == y
b += bj
end
bj <<= 1
y = B[j][2]
end
C[v + 3] = b
for i = 1:q
bj = UInt8(1)
b = UInt8(0)
for j = 1:8
if B[(i - 1)*8 + j + r][2] == y
b += bj
end
bj <<= 1
y = B[(i - 1)*8 + j + r][2]
end
C[v + i + 3] = b
end
return v + q + 3
end

It would be invoked as follows, for an ellipse with semiradii 100, 84:

julia> A = Array{Tuple{Int, Int}}(undef, 320);

julia> B = Array{Tuple{Int, Int}}(undef, 320);

julia> C = Array{UInt8}(undef, 40);

julia> m, n = ellipse(A, B, 100, 84)
(54, 78)

julia> q = ellipse_precomp(C, A, m, B, n)
21

julia> for i = 1:q
print(C[i], ", ")
end
7, 5, 0, 8, 34, 74, 85, 237, 190, 10, 6, 8, 16, 41, 85, 173, 109, 119, 223, 223, 255,

### Reply 68 of 85, by wbhart

Rank Newbie
Rank
Newbie

I've now written assembly code to precompute ellipses. It is actually now slightly faster to first precompute the ellipse then draw it from the precomputed information. This is not terribly surprising given that the precomputed data is used for both halves of the ellipse and no longer needs to be computed twice during ellipse drawing.

So the total time (including precomputation) for drawing an ellipse is now 148 cycles per pixel on a 4.77MHz 8088.

### Reply 69 of 85, by wbhart

Rank Newbie
Rank
Newbie

I have made some significant progress on the CGA graphics library I'm writing, after weeks of effort.

I now have pixel perfect ellipse code that runs in 140 cycles per pixel on the 8088 @4.77MHz without turning off interrupts.

On my 10MHz 8088 it's 147 cycles per pixel (I previously called this an 8086 machine, but it transpires that the markings on the chip are just really hard to read), and on the Amstrad PC1512 with 8086 CPU it's 115 cycles per pixel.

There is an additional 10000 cycles set up cost (8000 for the 8086), but this is currently double counted as I do the same set up twice, once for each half of the ellipse. So that setup time will nearly halve when I make a single function out of it. Also there are a couple of multiplications which can be omitted in half the cases and some loops that don't need so many iterations, so I anticipate something more like 4000 cycles set up cost.

The timings are not quite comparable to the previous ones I gave, since I was amortising the setup cost with the per pixel cost in those timings.

There's no memory accesses, no pushes or pops, just short jumps for all the loops. It's as optimal as it can be, essentially, though I do have to save one general purpose register in ES temporarily at one point, due to no longer using the SP register.

The trick to getting all this to work is a set of three small lookup tables to store corrections. The first correction is the number of bits to shift the starting values by so that all decisions can be made on 16 bit comparisons. The other two corrections are for self modifying the code in certain cases so that different inequalities are used (e.g. < instead of <=). This allows the code to produce pixel perfect ellipses, which the previous code did not. I am really, really surprised this was possible without turning off interrupts, especially since I wrote very many versions of the code before I hit on the correct ideas.

Another thing I've been playing with in the graphics library is using an effective resolution of 160x200 by drawing pairs of pixels in the horizontal direction. This has some advantages, e.g. counts now fit in a byte.

The cool thing about the ellipse code is that it should be possible to use 4-way symmetry instead of 2-way symmetry in this resolution, effectively halving the cost to draw ellipses. Of course that will rely on using SP, but still, it's too big an improvement to ignore.

I have coded up lines in this resolution and the horizontalish ones are about 35% faster as well (the verticalish ones are only about 5% faster).

Another cool trick I found is an approximate line drawing algorithm which saves a register. It's not Bresenham any more, but depends on a fixed point approximation. It's faster, but may not be pixel perfect (not that you'd notice). I also believe a similar trick can be done for ellipses, using a pair of fixed point approximations, but I haven't prototyped this yet.

The other thing I've done is write a 16 colour rotozoom in text mode. There are three different versions for various resolutions (one of which will have snow on original hardware of course). The frame rates are really quite high in all of these. I'll be making a video about that next week.

### Reply 70 of 85, by VileR

Rank l33t
Rank
l33t

Appreciate your ongoing work on this. I haven't had much to say thus far, but this is fantastic. Keep it up!

I remember seeing an effective "160x200" resolution in some CGA games. I always assumed this was simply to save space, or because the graphics were converted from something like low-res PCjr or C64... hadn't considered the speed advantage of making the coordinates fit in a byte.

I guess a related trick is what Mike Abrash did in his early games (fully addressable 320x200 resolution, but the real action was kept to the leftmost 256 pixels of each scanline - the right side of the screen was reserved for the status display).

[ WEB ] - [ BLOG ] - [ TUBE ] - [ CODE ]

### Reply 71 of 85, by wbhart

Rank Newbie
Rank
Newbie

There's an even more important advantage than making the coordinates fit into a byte, which I didn't mention. You have less computation to do as you move across the screen horizontally.

In the case of drawing a straight line, for example, something like 3/4 of the time is spent computing which pixels to draw, rather than the time spent putting the data into CGA memory. So if you have half as many pixels to "compute", the time goes down substantially.

Another important benefit is fewer cases. You only have two cases per "pixel". It's either in the left half or the right half of the byte. So if your algorithm depends on such cases you end up with half as many cases to deal with, which can mean you fit more cases between jumps. Sometimes you can use short jumps instead of long ones as a result.

So the benefits are really manifold. And it doesn't look too terrible.

Last edited by wbhart on 2020-07-16, 17:43. Edited 1 time in total.

### Reply 72 of 85, by wbhart

Rank Newbie
Rank
Newbie

For anyone interested, the new ellipse code is here:

https://github.com/wbhart/CGAGraphics/blob/ma … ast/ellipse.asm

Unfortunately it didn't come down to 4000 cycles for the setup, but the total cost for an ellipse is around 8700 + 140 cycles per pixel. That 8700 seems high, but it's only about 16 pixels worth per quadrant, which means that for medium to large ellipses it is not significant.

I could do some more work to reduce the size of the lookup tables, but other than that, I consider this code finished.

### Reply 74 of 85, by wbhart

Rank Newbie
Rank
Newbie

At the risk of posting EVERY SECOND VIDEO from my channel....

I have finally achieved 3D rotation (of a tetrahedron) in CGA graphics mode, which is a bit of a milestone for the CGA Graphics library.

If you want to see the video itself, it's here:

https://youtu.be/3DQ7HfGN60s

### Reply 75 of 85, by Zorko

Rank Newbie
Rank
Newbie

Dear wbhart,

I'm very impressed with your achievements at the quick drawing of pixel graphic and sprites, especially clipping off the outside of the screen! Especially on such a slow machine.

And I thought: wbhart is the person who can only help me.

I want to port the game from ZX Spectrum to DOS/CGA. The screen there consists of 32x24 colour characters (8x8). If we project this to a CGA resolution, we get a grid of the same 32x24 tiles with a resolution of 10x8 each tile. So I need a sprite output routine that outputs a sprite (with pixel-by-pixel precision) of any height and any width (specified in pixels). I started looking for ready-made solutions on the Internet and found that in fact I need to emulate the standard BGI graphics function "putimage" from Turbo C:

void far putimage(int left, Int top, void far *bitmap, int op);

Image format:

word width (in pixels)
word height (in pixels)
void *data of image

There is no alternating even and odd strings. Although undesirable, it is acceptable.

The list of wishes for the subroutine follows:

1. Not need the RLE compression or pre-compilation of sprites. In this case, this is unnecessary.
2. Need to be able to output images with a width that is not a multiple of 4 (for example, with a width of 10), i.e. the width of image is set in pixels, not in bytes.
3. Logical operations are also necessary.
4. It should also be interesting to have a version of this routine with out-of-screen output (as in your sprite engine). BGI function can't do it.

I've seen a lot of CGA graphics code, but it usually only outputs sprites that are multiples of 4 in width. So I rather need a subroutine for images than for a full-fledged sprite engine. And I think only you can develop such a procedure. Or just give me some good advice.

Thanks!

### Reply 76 of 85, by Zorko

Rank Newbie
Rank
Newbie

I came up with two tricks for the analog of "putimage" subroutine:

1. In the game Battle City for CGA, I saw a subroutine that is approximately similar to what I need. But that subroutine uses an internal buffer for bit-shifting the line.

http://www.balagurov.com/software/tank

How to do without a buffer? To do this, we need to read two bytes of the sprite line at once, shift the word in a register to the desired position, and output only one resulting byte. After that, we increase the line address by 1 to get one new, and one the same byte of a sprite, and shift it again.

So we will seriously speed up the work due to the lack of a buffer.

2. There is no need to store even and odd lines separately. Just one "go to the next line" operation will be enough, regardless of its parity.

To go from an even line to an odd one, need to increase the screen address by 1FFCh, and to go from an odd line to an even line - decrease by 1FB4h.
We can turn a subtraction operation into an addition operation by adjusting the number (-1FB4h = 0E04Ch).

Thus, we calculate the first number to go to the next line, regardless of its parity. And after adding, we turn one number to add to another. We need to turn the number 1FFCh into 0E04Ch, and 0E04Ch into 1FFCh.

N-1FFCh = 0E04Ch (word)
N-0E04C = 1FFCh (word)
N = 48h

Thus, after the increase to go to the next line, we simply adjust the increase value using the operation:

N = 48h - N

Subtraction and addition are quick operations. This way, both code compactness and relative speed will be achieved.

### Reply 77 of 85, by Zorko

Rank Newbie
Rank
Newbie

We can make it even easier:

If the first line is odd, GOTO 1\$
0\$:
Draw an even line
If there are no more lines, exit
Switch to the next odd line
1\$:
Draw an odd line
If there are no more lines, exit
Switch to the next even line
GOTO 0\$

This is a fast and good algorithm. This way, there is really no need to store even and odd lines in separate data blocks.

### Reply 78 of 85, by wbhart

Rank Newbie
Rank
Newbie

Hi zorko,

I think that a routine to put an image onscreen very fast is a great idea for the CGA library I'm writing. And I agree that sometimes one doesn't want the data format to be too complicated.

I am planning on making a routine for doing text, e.g. for a text scroller, and such a routine would be a good starting point.

I can't promise to do anything very soon, but I'll definitely put this on the list of ideas and work on it as soon as I find some time.

The one problem I see with your idea is that shifting is quite slow on the 8088/8086. It may actually be faster to use a buffer, but this will have to be investigated.

Your idea to use 48h - N is clever. If you have a spare register to hold the 48h then this would be quite quick. If not, one can do -(N - 48h) which is just two instructions, though there's quite a lot of bytes in the "subtraction of immediate value" instruction.

### Reply 79 of 85, by Zorko

Rank Newbie
Rank
Newbie

Dear wbhart!
Thanks for your reply. We can't do without a shift anyway, since we want to output positions that are not multiples of four.
I developed the subroutine that I roughly wanted. And it can be a starting point for learning. Also I will certainly be very happy with your additions and criticism.

Develop a subroutine that outputs a sprite (with pixel accuracy) of any height specified in pixels and any width specified in bytes
So far, there is an implementation only for output with logical operations. It will be necessary to modify the overlay "along the edges".

void PutSpr (int x, int y, void *spr)

Sprite format

Width (in bytes)
Height (in pixels)
Sprite data

1. Calculate the screen address of the first upper byte for output from the x and y coordinates
And the offset for the shift {0..3}.
If the offset =0, the byte is output unchanged, if 1, with a shift of 1 pixel to the right, and so on.
Convert (*2) offset to the number of bits to shift: 0=>0; 1=>2; 2=>4; 3=>6

2. Get from the sprite address the length and height of the sprite, save in registers

3. Draw the sprite line height times. Take into account the specifics of the CGA with its two screen memory planes, using the algorithm described above.

Implementation for CGA (Turbo C):

void GrApp_PutSpr (unsigned int x, unsigned int y, void *spr) { /* Draw CGA sprite */
asm MOV BX, x
asm MOV CL, BL
asm AND CL, 3
asm SHL CL, 1 /* CL - число бит для сдвига */
asm MOV AX, y
asm XCHG AH, AL
asm SHR AX, 1
asm XOR AL, AL
asm SHR AX, 1
asm SHR AX, 1
asm SHR BX, 1
asm SHR BX, 1 /* BX - смещение байта */
asm MOV AX, 0B800H
asm MOV ES, AX /* ES:BX = screen address */
asm MOV SI, spr /* DS:SI = sprite address */
asm CLD
asm LODSW /* len */
asm XCHG DX, AX /* DL = len; DH = hgt */
asm MOV DI, BX
asm CMP BH, 20H
asm JNC ODD
EVEN:
asm CALL @DRAWLINE /* Draw even line */
asm JZ EXIT
ODD:
asm CALL @DRAWLINE /* Draw odd line */
asm JZ EXIT
asm SUB DI, 1FB0H
asm JMP SHORT EVEN

The sprite line output:

Remember the screen address and length of the sprite on the stack
Get a byte of sprite data. Save it (in the conditional register AL)
Copy of the register AL shift to the desired position {0..3} and output it on the screen
Decrease the width of the sprite (in bytes)
If width = 0 GOTO Last_byte
DRAWBYTE:
Get a byte of sprite data
Construct from the previously stored and new byte the new word for the next shift
Remember the new byte instead of the old one
Shift a two-byte word to the desired position {0..3} and output it on the screen
Decrease the width of the sprite (in bytes)
If the width is # 0 GOTO DRAWBYTE
Last_byte:
If shift = 0 (the entire byte is already output by the first output, and it doesn't need to be output anymore),
then GOTO Exit
Construct a new word from the stored byte and zero (the last word in the line)
Output it on the screen
Exit:
Restore the length of the sprite and the screen address from the stack, decrease the height of the sprite
RETURN
asm @DRAWLINE:
asm PUSH DI
asm PUSH DX
asm LODSB
asm MOV BL, AL
asm SHR BL, CL /* 011100.10 => XX.011100 */
asm OR ES:[DI], BL
asm INC DI
asm DEC DL
asm JZ DRAWLAST
DRAWBYTE:
asm MOV AH, AL
asm LODSB
asm MOV BX, AX
asm SHR BX, CL /* 011100.10 11111111 => 10.111111 */
asm OR ES:[DI], BL
asm INC DI
asm DEC DL
asm JNZ DRAWBYTE
DRAWLAST:
asm MOV AH, AL
asm XOR AL, AL
asm CMP AL, CL
asm JZ ZEROLAST
asm SHR AX, CL /* 111111.11 => 10.111111 */
asm OR ES:[DI], AL
ZEROLAST:
asm POP DX
asm POP DI
asm DEC DH
asm RET
EXIT:;
} /*GrApp_PutSpr*/

Loop through output lines
Output the left byte (for PUT - with clipping, for AND/OR/XOR - with overlay)
Output all middle bytes (there can be 0)
Output the right byte (for PUT - with clipping, for AND/OR/XOR - with overlay)