VOGONS

Common searches


CGA Graphics library

Topic actions

Reply 40 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

To illustrate my point about instruction times, here is two pixels annotated with instruction times as per the 8086 book:

   mov al, [bx+di]      ; 8 + 5 (Base rel. index. EA)
and al, 0cfh ; 4
or al, 030h ; 4
mov [bx+di], al ; 9 + 5 (Base rel. index. EA)
add dx, bp ; 3
jg line2_incx31 ; taken: 16, not taken 4
add dx, si ; 3

mov al, [di] ; 8 + 5 (Base rel. index. EA)
and al, 0cfh ; 4
or al, 030h ; 4
stosb ; 11
add dx, bp ; 3
jg line2_incx32 ; taken: 16, not taken 4
add dx, si ; 3
add di, 79 ; 4

loop line2_loop2 ; taken 17 (usually the case), not taken 5

Assuming every second jump is taken on average, I get 117 cycles. Ok I see what I did wrong. I took the total for two pixels, not one.

So the instruction timings only account for about 58.5 cycles per pixel. Add 6 for DRAM refresh and an average of 12 for CGA wait states. Hmm, that's only 76.5. Well this still doesn't make any sense. Is the prefetch not prefetching instructions whilst the CPU is executing instructions? I guess there could be 8 cycles of stalls each time a branch is taken (if that isn't already factored into the instruction timings). I'm struggling to see where the rest of the time goes.

I count 31 bytes of instructions, which should be 124 cycles for the prefetch hardware. That's 62 per pixel.

Nope. I don't understand the bus cycle business.

YouTube Channel - PCRetroTech

Reply 41 of 85, by pan069

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

I'm struggling to see where the rest of the time goes.

I count 31 bytes of instructions, which should be 124 cycles for the prefetch hardware. That's 62 per pixel.

Nope. I don't understand the bus cycle business.

I have only been following this thread superficially so I might repeat or suggest things that aren't relevant, however, if I'm not mistaken, you didn't opt to use Abrash's ZenTimer (or a similar approach), correct? I think you're stuck in "counting reference manual clock cycles mode" to make sense of execution bottlenecks.

I think the way to optimise code is to lay down an approach you believe is fast, then you measure that approach based on time. If you're not happy with that time, you tweak or overhaul the approach and rise and repeat until you have exhausted ideas.

As you mentioned; "I don't understand the bus cycle business". I don't think anyone does and/or it isn't measurable in scientific way, more an act of God or something. The only way to get insight into performance of code is by timing, not by counting.

Take the clock cycles from the manual as an indication and be aware of the various cycle-eaters (as Abrash calls them).

Chapter 4 of Programming Black Book:

• The 8-bit bus cycle-eater causes each access to a word-sized operand to be 4 cycles longer than an equivalent access to a byt […]
Show full quote

• The 8-bit bus cycle-eater causes each access to a word-sized operand to be 4 cycles longer than an equivalent access to a byte-sized operand.
• The prefetch queue cycle-eater can cause instruction execution times to be as much as four times longer than the officially documented cycle times.
• The DRAM refresh cycle-eater slows most PC code, with performance reductions ranging as high as 8.33 percent.
• The display adapter cycle-eater typically doubles and can more than triple the length of the standard 4-cycle access to display memory, with intensive display memory access suffering most.

https://www.phatcode.net/res/224/files/html/c … .html#Heading24

Reply 42 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

Sure, which is why it totally shouldn't be used to describe mov mem/reg, mem/reg. It had me believing that reg would be on the left if the direction flag was 0 and on the right if 1. But that is not what they mean at all. There is a bit in the instruction encoding itself that they are calling the direction flag. It actually says, "d is the direction flag. If d = 0...."

Oh, I see what they mean! Yes, bit 1 in the opcode distinguishes between "mov reg,mem" and "mov mem,reg". But yeah, calling that bit the direction flag is very confusing because that's usually used to mean something completely different.

wbhart wrote:

I note that the number of bus accesses is probably much lower than the number of cpu cycles required for the instructions in this case. I'm also not sure I understand when I should add bus access times to the instruction timings. Is that only if there is a holdup due to too many bus accesses? Or does the CPU always incur these costs regardless? If the latter, then I am having trouble figuring out the timings, as it seems to be entirely accounted for by CPU instruction timings! We are using some pretty hefty instructions.

The CPU can be thought of as two units operating asynchronously, for the most part. The execution unit (EU) is what actually executes the instructions once they are in the CPU, and the bus interface unit (BIU) handles reads and writes and fetches instruction bytes into the prefetch queue when neither is going on (and when the queue is not full). The published instruction timings are (roughly) EU-only timings, which is to say that they are the best-case timings when the prefetch queue is full and there are no bus holds or wait states. How long the EU has to wait for the BIU to do a read or a write depends on where it is in the prefetch bus cycle when the request is made. There are some counter-intuitive behaviours, like sometimes a instruction starting later will result in it finishing earlier.

wbhart wrote:

Logic dictates it must be towards the end of the instruction, as EA needs to be computed and so on. But it does seem like writes can take a cycle less than reads from memory, if the mov mem/reg, mem/reg timings are anything to go by.

Yes, the EU can fire off a write to the BIU and then immediately start the next instruction. Whereas for a read it must wait for the read value to come back from the BIU and do something with it (like put it in a register).

wbhart wrote:

I was thinking that a 5/5/6 cycle cadence would be optimal for CGA accesses based on the wait state information in your blog. That seems to imply that if the code is timed to request data from the CGA memory after 5, 10 or 15 cycles, that would be optimal, given that we cannot insert an extra cycle every 16 cycles (and DMA refresh probably ruins optimality anyway).

Yes, that sounds about right. The wait states introduced by the CGA are 3 cycles minimum, 8 cycles maximum. So if you are getting the worst case behaviour there is the potential to move an instruction of up to 5 cycles before that CGA access and have it execute for free. But in practice the wait states you see probably aren't going to be consistent from iteration to iteration, because the prefetch queue and bus state are also not going to be consistent. You never know without trying, though!

wbhart wrote:

By the way, I've seen people talking about turning off DMA refresh. But how does one's program code and data get refreshed if you do this? Or is it possible to selectively turn off DMA refresh for certain segments? (Not that this is a good piece of code to do this in. I am just curious.)

It is not possible to selectively turn it off. You can turn it off without endangering memory contents for a short period if you turn up the refresh rate to a high rate for a while first (to ensure all RAM has been recently refreshed) and do the same when you turn it back on again (to ensure that no row goes too long without a refresh). It's useful for doing precise and consistent timing measurements but is not really helpful for normal code. I used to think it would be possible to refresh by ensuring that each row was accessed by the running code (either by reads or prefetches) but that turns out only to refresh the bank being accessed, whereas the proper mechanism refreshes RAM in all the banks at once.

wbhart wrote:

I count 31 bytes of instructions, which should be 124 cycles for the prefetch hardware. That's 62 per pixel.

Nope. I don't understand the bus cycle business.

The EU and BIU largely run in parallel, so you can't just add together the BIU cycles and the EU cycles without knowing how many overlap. There will be quite a few cycles where the EU is waiting for the BIU (either because the next instruction hasn't been fetched yet or because the instruction needs to do a bus access and must wait for the current prefetch cycle to complete). So without knowing what is happening on each cycle of each instruction, you can't really count cycles accurately the way you can on simpler CPUs.

Reply 43 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I managed to shave a few cycles off the verticalish lines. The times are now down to 17.95s or ~106.5 cycles/pixel.

I also dug an enormous amount of bugs and corner cases out of the code and wrote a verticalish line blanking routine.

This is enough to get my original demo working again that started this whole thing. I made a video showing it off here: https://youtu.be/hxNBbWojhaQ

The video starts with about a minute of footage of a retro hard drive, but it quickly gets on with the demo. There's a lot of flicker due to the camera, but towards the end of the video I mostly sort that out too. The result looks quite good I think.

YouTube Channel - PCRetroTech

Reply 44 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

So I decided to have another go at this cycle counting business. I feel like I have more confidence doing this with AMD CPU's of 2008 era than the original 8086 because one can often prove optimality by counting retirement of macro ops.

Anyhow, in the following I give two timings for each instruction. The first column is the number of clocks before the instruction is available to be executed due to not being in the prefetch queue. The second column will be the instruction execution timing as per the 8086 book. I'll assume a 4 byte prefetch queue which I will arbitrarily take to be empty at the beginning of the loop. I will assume there are 6 cycles of wait states on average on the bus with every CGA memory access, as we don't know what the numbers should really be. I'll also add 4 cycles on the bus for every byte of CGA memory read/written. I won't explicitly annotate those, but will take them into account when working out how many cycles the prefetch hardware has available to access the bus. Of course I will use a figure of 4 cycles per bus access, and I will assume each byte of instruction code requires a bus access. I'll also assume the prefetch queue is empty after a jump taken. At each conditional jump I'll split into two columns, one for taken and the other for not taken, and I will merge again if the times start to coincide at some point. At the end I'll take the average over all paths. I'll assume the loop is always taken.

The total of all the numbers I wrote down over an average path should approximate the total real world time after adding an extra 1/12 for DMA refresh.

   mov al, [bx+di]      ; 8, 13
and al, 0cfh ; 5, 4
or al, 030h ; 4, 4
mov [bx+di], al ; 4, 14
add dx, bp ; 4, 3
jg line2_incx31 ; 5, 16 5, 4
add dx, si ; 4, 3

mov al, [di] ; 8, 13 5, 13
and al, 0cfh ; 5, 4
or al, 030h ; 4, 4
stosb ; 0, 11
add dx, bp ; 7, 3
jg line2_incx32 ; 5, 16 5, 4
add dx, si ; 4, 3
add di, 79 ; 12, 4 11, 4

loop line2_loop2 ; 4, 17

The totals for the four paths through the code are 201, 193, 195, 187. The average of these is 194. Add 1/12 for DRAM refresh gives 210 cycles, or 105 cycles per pixel, which is close enough to the 108 I measured to be about right, especially given that the lead in and out code contributes a few cycles per pixel on average.

Of course I cheated a little by knowing what answer I was expecting.

YouTube Channel - PCRetroTech

Reply 45 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I've now added a circle drawing routine (with adjustment for a 1.2:1 CGA pixel aspect ratio) in cga_draw_circle1() in cga5.asm. It currently only draws the right hand side of the circle, but the other side will be analogous.

After some timings, I get about 111 cycles/pixel on the 8088 @ 4.77MHz, which is about the same as for verticalish lines. Of course it's only so low because I'm exploiting symmetry to draw two pixels at once.

I can probably shave a couple of cycles off the verticalish part of the code, but I don't see how to speed up the horizontalish part because I ran out of registers.

I'll hopefully fill in the other half of the circle tomorrow some time.

YouTube Channel - PCRetroTech

Reply 46 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

The full circle routine is working now and the timings are 109 cycles/pixel on the 8088 @ 4.77MHz,

Next I will do the xor and blanking version and if I get motivated enough, maybe the special colour 0 and 3 versions.

YouTube Channel - PCRetroTech

Reply 47 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I've now added special versions of circle drawing for colours 0 and 3 (which just use AND and OR), an XOR version and a circle blanking routine. The timings (on a 4.77MHz 8088) are now:

full : 109 cycles/pixel
or/and/xor : 98.5 cycles/pixel
blanking : 78.5 cycles/pixel

I thought about the ellipse code to see if it is possible to do a version of that for any size ellipse which uses only registers and only (short) conditional jumps in the main loops. But so far I have not come up with a way to do it, at least not in a meaningful way. There just aren't enough registers, it seems.

YouTube Channel - PCRetroTech

Reply 48 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

I thought about the ellipse code to see if it is possible to do a version of that for any size ellipse which uses only registers and only (short) conditional jumps in the main loops. But so far I have not come up with a way to do it, at least not in a meaningful way. There just aren't enough registers, it seems.

Not sure I'll have time to look at this very soon, but I do like a challenge! Can I see what you've got so far? It seems like it ought to be possible. The technique that springs to mind is using an immediate operand in the actual code as a "register". Slower than an actual register, but faster than spilling it completely to memory.

patch:
mov ax,9999 ; This value is patched
add ax,9999 ; This value is also patched
mov [cs:patch+1],ax
add bx,ax

Reply 49 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

@reenigne The problem is that one needs 32 bit values. As I see it, there are at least three required. That's already 6 registers.

Here is some Julia code that shows the algorithm:

function ellipse(r::Int, s::Int)
x = r;
y = 0;
r_orig = r
c = s*s
a = r*r
D = 0
xdelta = 2*c*r_orig
ydelta = a
while xdelta >= ydelta
println("(", x, ", ", y, ")")
D += ydelta; ydelta += 2a; y += 1;
if D >= div(xdelta, 2)
xdelta -= c; D -= xdelta; xdelta -= c; x -= 1;
end
end
D = -D
while xdelta >= 0
println("(", x, ", ", y, ")")
xdelta -= c; D += xdelta; xdelta -= c; x -= 1;
if D > div(ydelta, 2)
D -= ydelta; ydelta += 2a; y += 1
end
end
end

Naturally c and a can go in the code in the way you suggest. But xdelta, ydelta and D need much more than 16 bits. There are some tricks to squeeze an extra bit or two out here and there, but I think there's no way around needing two registers for each. Then you need AX for an accumulator (both halves), DI for the offset, at least one for the odd/even line thing, another to exploit symmetry in the ellipse.

Perhaps xdelta, ydelta and D can each be done in 24 bits. Then it might be possible, but only if you drop the symmetry trick.

The best code to look at is cga_draw_circle1/2 in CGA5.asm. It's precisely the algorithm in the Julia code above where r and s are set to 6 and 5 (you can always replace semiradii r, s with r/d, s/d where d = gcd(r, s), though r_orig in the code above cannot be divided by d. Of course if your ellipse has semiradii 158 and 99 then the gcd is 1, so this trick only saves you if you have a large d).

Last edited by wbhart on 2019-10-04, 15:09. Edited 2 times in total.

YouTube Channel - PCRetroTech

Reply 50 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Yeah, 24 bits seems to be enough, just. So if I didn't miscalculate, it is possible in terms of the number of registers if you drop the exploitation of symmetry. But this isn't a slam dunk if you want to only use short jumps. The code for updating 24 bit values is not cheap, especially as you have to do both signed and unsigned comparisons (the code for circles was already a stretch).

Even if it's possible, it's totally not clear to me if a cleaned up version of my original ellipse code (in CGA.asm) would be faster with spilling but exploitation of symmetry, or whether a version with no spilling and no symmetry would be better. As usual, one has to write both to figure that out.

YouTube Channel - PCRetroTech

Reply 52 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Yes, I see what you mean. Then I only have to (explicitly) write them and not read them from memory.

I'm not sure how possible that would be since both are accessed multiple times in each iteration. If one of those were not inside a conditional block it might work, since you'd know where it would be accessed next.

It's even worse if you consider breaking up into 4 or 8 cases to handle the video side of things, as you have one of two places where you will access the value.

YouTube Channel - PCRetroTech

Reply 53 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

It looks like each iteration of the two inner loops separates fairly nicely into an "xdelta live" section and a "ydelta live" section, except for the fact that in the first loop you need to compare xdelta against ydelta. I wonder if you can get away with just comparing the most significant words of those, or if that causes visible glitches.

Reply 54 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I will check, but all attempts to lose any bits have so far failed for me. I've tried all sorts of things. It really seems like all bits are critical for the algorithm.

But I now see what you mean about keeping xdelta live, etc.

Of course it may be possible to replace the first comparison with a number of iterations. I'm not sure how expensive that would be to compute. Roughly speaking it can be expressed as some condition on the derivative, which should be just a linear condition, so it probably involves one division. It's difficult to get it exactly right though.

I looked through a bunch of old papers on ellipse algorithms, but there are only ones that halve the maximum integer that appears. So they save me one whole bit. Whoopdidoo!

YouTube Channel - PCRetroTech

Reply 55 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

There's a relatively short list of ellipses that yield glitches when comparing the most significant words of xdelta and ydelta. Most could be handled by ellipse code that only worked for small semiradii, leaving a very short list which would have to be handled by more general code.

In fact, here is an exhaustive list of such semiradii (assuming maximum semiradii of 160, 100 of course).

8, 63
9, 80
15, 43
18, 63
26, 94
30, 19

YouTube Channel - PCRetroTech

Reply 56 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Oh, except the following *always* draws the ellipse without any glitches (up to semiradii 160, 100)!!

function ellipse2(A, r::Int, s::Int)
i = 1;
x = r;
y = 0;
r_orig = r
c = (s*s) << 8
a = (r*r) << 8
D = 0
xdelta = 2*c*r_orig
ydelta = a
while (xdelta >> 8) >= (ydelta >> 8)
A[i] = (x, y); i += 1;
D += ydelta; ydelta += 2a; y += 1;
if (D >> 8) >= (xdelta >> 9)
xdelta -= c; D -= xdelta; xdelta -= c; x -= 1;
end
end
D = -D
while (xdelta >> 8) >= 0
A[i] = (x, y); i += 1;
xdelta -= c; D += xdelta; xdelta -= c; x -= 1;
if (D >> 8) > (ydelta >> 9)
D -= ydelta; ydelta += 2a; y += 1
end
end
return i - 1
end

You might be onto something (assuming I didn't screw up somewhere here)!!

YouTube Channel - PCRetroTech

Reply 57 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Below is what each pixel of the verticalish part of the ellipse looks like (modulo errors), with the following registers:

   ; di, di+bx offsets of points above and below axis, ah:accumulator/scratch
; dx:deltax (hi16), sp:yinc, bp:deltay (hi16),
; al: D (lo8), ch: deltax (lo8), cl: deltay (lo8), si: D (hi16)

There's about 80 bytes there, which is just far too much. This would need to be well under 60 bytes in order for the conditional jumps to reach (there are four such sections, one for each value of the screen x coordinate modulo 4). Though I haven't ruled out that there is some clever rearrangement of the code that makes all the jumps reach.

If no such rearrangement exists, I think this implies that we need to do a more naive translation of the Julia code above (which I further simplified above), rather than splitting into cases for x mod 4. That of course means a higher cost for doing the mask and colour arithmetic, and also means storing at least 16 bits in cs as immediates to make space for both the mask and colour (and an extra byte for an accumulator).

The other option is to drop symmetry, but I think this still wont get us over the line.

ellipse1_jump2:
mov ah, [di+bx] ; draw pixel above axis
and ah, 0cfh
ellipse1_patch1:
or ah, 030h
mov [di+bx], ah
mov ah, [di] ; draw pixel below axis
and ah, 0cfh
ellipse1_patch2:
or ah, 030h
mov [di], ah

add di, sp ; update offset
ellipse1_patch3:
xor sp, 01234h ; update offset update for odd<->even
sub bx, 80 ; decrement/increment y lines

add ah, cl ; D += dy
adc si, bp
ellipse1_patch4:
add cl, 012h ; dy += 2r^2
ellipse1_patch5:
add bp, 01234h

shr dx, 1 ; if dx/2 <= D, decrement x
sahf
cmp dx, si
jle ellipse1_x1

lahf
rcl dx, 1
cmp dx, bp ; check if done verticalish
jae ellipse1_jump2
jmp ellipse1_donev2 ; done verticalish

ellipse1_x2:
lahf
rcl dx, 1
ellipse1_patch6:
sub ch, 012h ; dx -= s^2
ellipse1_patch7:
sbb dx, 01234h
sub al, ch ; D -= dx
sbb si, dx
ellipse1_patch8:
sub ch, 012h ; dx -= s^2
ellipse1_patch9:
sbb dx, 01234h

cmp dx, bp ; check if done verticalish
jae ellipse1_jump2
jmp ellipse1_donev2 ; done verticalish

YouTube Channel - PCRetroTech

Reply 58 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I think I may have found a reordering of the code which brings the maximum conditional jump down from over 160 to just over 110 bytes. It looks promising, but I'll have to implement it to make sure I didn't screw up. I'll probably do that tomorrow.

YouTube Channel - PCRetroTech

Reply 59 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Below is how the code works out for the verticalish part of the ellipse.

Basically, if we number code sections 1, 1a, 2, 2a, 3, 3a, 4, 4a where section n is the section that deals with pixels that have x-coord equal to n-1 mod 4, and the "a" sections are the alternative endings for those sections, then the ordering of the code below is:

2, 2a, 3, 1a, 3a, 4, 4a, 1

Section n has to jump back to the start of itself, and the end of section na has to jump there too. The end of section n also has to jump to the start of section (n-1)a.

So with this ordering, the maximum jump is from the end of 1a to the start of 1 or from the end of 3a to the start of 3, both of which are equal to two "a" sections plus one "non-a" section.

By my calculation a "non-a" section is 52 bytes and an "a" section is 28 bytes (29 in the case of 4a). That makes the longest jump from the end of 1 to the start of 1a, which is 52 + 28 + 29 + 1 + 3 = 113 bytes, where the 1 is for alignment and the 3 is because the jump isn't quite at the end of a section.

The only thing that is worse than the circle code so far (apart from the huge number of patches and the fact that the computed jump is now a nightmare) is the 24 bit arithmetic, which is probably unavoidable, and the fact that there isn't a 16 bit accumulator free to move dx into so it can non-destructively be shifted right by 1. This necessitates doing the shift in dx itself, which in turn necessitates the use of lahf/sahf combinations to "save" the bit that is shifted out, so it can be shifted back in again.

I haven't managed to find a way around this, either mathematically or in code.

I guess the other slowdown is due to having to deal with the two symmetric points entirely separately, again due to only ah being available, and having to use ah instead of al, meaning stosb is not available, all due to needing lahf/sahf.

                        ; verticalish part of circle
ALIGN 2
ellipse1_jump2:
mov ah, [di+bx] ; draw pixel above axis
and ah, 0cfh
ellipse1_patch1:
or ah, 030h
mov [di+bx], ah
mov ah, [di] ; draw pixel below axis
and ah, 0cfh
ellipse1_patch2:
or ah, 030h
mov [di], ah

add di, sp ; update offset
ellipse1_patch3:
xor sp, 01234h ; update offset update for odd<->even
sub bx, 80 ; decrement/increment y lines

add al, cl ; D += dy
adc si, bp
ellipse1_patch4:
add cl, 012h ; dy += 2r^2
ellipse1_patch5:
adc bp, 01234h

shr dx, 1 ; if dx/2 <= D, decrement x
lahf
cmp dx, si
jle ellipse1_x1

sahf
rcl dx, 1
cmp dx, bp ; check if done verticalish
jae ellipse1_jump2
jmp ellipse1_donev2 ; done verticalish

ellipse1_x2:
sahf
rcl dx, 1
ellipse1_patch6:
sub ch, 012h ; dx -= s^2
ellipse1_patch7:
sbb dx, 01234h
sub al, ch ; D -= dx
sbb si, dx
ellipse1_patch8:
sub ch, 012h ; dx -= s^2
ellipse1_patch9:
sbb dx, 01234h

cmp dx, bp ; check if done verticalish
jae ellipse1_jump2
jmp ellipse1_donev2 ; done verticalish


ALIGN 2
ellipse1_jump3:
mov ah, [di+bx] ; draw pixel above axis
and ah, 0f3h
Show last 161 lines
ellipse1_patch10:
or ah, 0ch
mov [di+bx], ah
mov ah, [di] ; draw pixel below axis
and ah, 0f3h
ellipse1_patch11:
or ah, 0ch
mov [di], ah

add di, sp ; update offset
ellipse1_patch12:
xor sp, 01234h ; update offset update for odd<->even
sub bx, 80 ; decrement/increment y lines

add al, cl ; D += dy
adc si, bp
ellipse1_patch13:
add cl, 012h ; dy += 2r^2
ellipse1_patch14:
adc bp, 01234h

shr dx, 1 ; if dx/2 <= D, decrement x
lahf
cmp dx, si
jle ellipse1_x2

sahf
rcl dx, 1
cmp dx, bp ; check if done verticalish
jae ellipse1_jump3
jmp ellipse1_donev3 ; done verticalish

ellipse1_x1:
sahf
rcl dx, 1
ellipse1_patch15:
sub ch, 012h ; dx -= s^2
ellipse1_patch16:
sbb dx, 01234h
sub al, ch ; D -= dx
sbb si, dx
ellipse1_patch17:
sub ch, 012h ; dx -= s^2
ellipse1_patch18:
sbb dx, 01234h

cmp dx, bp ; check if done verticalish
jae ellipse1_jump1
jmp ellipse1_donev1 ; done verticalish


ellipse1_x3:
sahf
rcl dx, 1
ellipse1_patch19:
sub ch, 012h ; dx -= s^2
ellipse1_patch20:
sbb dx, 01234h
sub al, ch ; D -= dx
sbb si, dx
ellipse1_patch21:
sub ch, 012h ; dx -= s^2
ellipse1_patch22:
sbb dx, 01234h

cmp dx, bp ; check if done verticalish
jae ellipse1_jump3
jmp ellipse1_donev3 ; done verticalish

ALIGN 2
ellipse1_jump4:
mov ah, [di+bx] ; draw pixel above axis
and ah, 0fch
ellipse1_patch23:
or ah, 03h
mov [di+bx], ah
mov ah, [di] ; draw pixel below axis
and ah, 0fch
ellipse1_patch24:
or ah, 03h
mov [di], ah

add di, sp ; update offset
ellipse1_patch25:
xor sp, 01234h ; update offset update for odd<->even
sub bx, 80 ; decrement/increment y lines

add al, cl ; D += dy
adc si, bp
ellipse1_patch26:
add cl, 012h ; dy += 2r^2
ellipse1_patch27:
adc bp, 01234h

shr dx, 1 ; if dx/2 <= D, decrement x
lahf
cmp dx, si
jle ellipse1_x3

sahf
rcl dx, 1
cmp dx, bp ; check if done verticalish
jae ellipse1_jump4
jmp ellipse1_donev4 ; done verticalish


ellipse1_x4:
dec di
sahf
rcl dx, 1
ellipse1_patch28:
sub ch, 012h ; dx -= s^2
ellipse1_patch29:
sbb dx, 01234h
sub al, ch ; D -= dx
sbb si, dx
ellipse1_patch30:
sub ch, 012h ; dx -= s^2
ellipse1_patch31:
sbb dx, 01234h

cmp dx, bp ; check if done verticalish
jae ellipse1_jump4
jmp ellipse1_donev4 ; done verticalish

ALIGN 2
ellipse1_jump1:
mov ah, [di+bx] ; draw pixel above axis
and ah, 03fh
ellipse1_patch32:
or ah, 0c0h
mov [di+bx], ah
mov ah, [di] ; draw pixel below axis
and ah, 03fh
ellipse1_patch33:
or ah, 0c0h
mov [di], ah

add di, sp ; update offset
ellipse1_patch34:
xor sp, 01234h ; update offset update for odd<->even
sub bx, 80 ; decrement/increment y lines

add al, cl ; D += dy
adc si, bp
ellipse1_patch35:
add cl, 012h ; dy += 2r^2
ellipse1_patch36:
adc bp, 01234h

shr dx, 1 ; if dx/2 <= D, decrement x
lahf
cmp dx, si
jle ellipse1_x4

sahf
rcl dx, 1
cmp dx, bp ; check if done verticalish
jae ellipse1_jump1
jmp ellipse1_donev1 ; done verticalish

YouTube Channel - PCRetroTech