To illustrate my point about instruction times, here is two pixels annotated with instruction times as per the 8086 book:

1 mov al, [bx+di] ; 8 + 5 (Base rel. index. EA) 2 and al, 0cfh ; 4 3 or al, 030h ; 4 4 mov [bx+di], al ; 9 + 5 (Base rel. index. EA) 5 add dx, bp ; 3 6 jg line2_incx31 ; taken: 16, not taken 4 7 add dx, si ; 3 8 9 mov al, [di] ; 8 + 5 (Base rel. index. EA) 10 and al, 0cfh ; 4 11 or al, 030h ; 4 12 stosb ; 11 13 add dx, bp ; 3 14 jg line2_incx32 ; taken: 16, not taken 4 15 add dx, si ; 3 16 add di, 79 ; 4 17 18 loop line2_loop2 ; taken 17 (usually the case), not taken 5

Assuming every second jump is taken on average, I get 117 cycles. Ok I see what I did wrong. I took the total for two pixels, not one.

So the instruction timings only account for about 58.5 cycles per pixel. Add 6 for DRAM refresh and an average of 12 for CGA wait states. Hmm, that's only 76.5. Well this still doesn't make any sense. Is the prefetch not prefetching instructions whilst the CPU is executing instructions? I guess there could be 8 cycles of stalls each time a branch is taken (if that isn't already factored into the instruction timings). I'm struggling to see where the rest of the time goes.

I count 31 bytes of instructions, which should be 124 cycles for the prefetch hardware. That's 62 per pixel.

I'm struggling to see where the rest of the time goes.

I count 31 bytes of instructions, which should be 124 cycles for the prefetch hardware. That's 62 per pixel.

Nope. I don't understand the bus cycle business.

I have only been following this thread superficially so I might repeat or suggest things that aren't relevant, however, if I'm not mistaken, you didn't opt to use Abrash's ZenTimer (or a similar approach), correct? I think you're stuck in "counting reference manual clock cycles mode" to make sense of execution bottlenecks.

I think the way to optimise code is to lay down an approach you believe is fast, then you measure that approach based on time. If you're not happy with that time, you tweak or overhaul the approach and rise and repeat until you have exhausted ideas.

As you mentioned; "I don't understand the bus cycle business". I don't think anyone does and/or it isn't measurable in scientific way, more an act of God or something. The only way to get insight into performance of code is by timing, not by counting.

Take the clock cycles from the manual as an indication and be aware of the various cycle-eaters (as Abrash calls them).

Chapter 4 of Programming Black Book:

• The 8-bit bus cycle-eater causes each access to a word-sized operand to be 4 cycles longer than an equivalent access to a byt […] Show full quote

• The 8-bit bus cycle-eater causes each access to a word-sized operand to be 4 cycles longer than an equivalent access to a byte-sized operand.
• The prefetch queue cycle-eater can cause instruction execution times to be as much as four times longer than the officially documented cycle times.
• The DRAM refresh cycle-eater slows most PC code, with performance reductions ranging as high as 8.33 percent.
• The display adapter cycle-eater typically doubles and can more than triple the length of the standard 4-cycle access to display memory, with intensive display memory access suffering most.

Sure, which is why it totally shouldn't be used to describe mov mem/reg, mem/reg. It had me believing that reg would be on the left if the direction flag was 0 and on the right if 1. But that is not what they mean at all. There is a bit in the instruction encoding itself that they are calling the direction flag. It actually says, "d is the direction flag. If d = 0...."

Oh, I see what they mean! Yes, bit 1 in the opcode distinguishes between "mov reg,mem" and "mov mem,reg". But yeah, calling that bit the direction flag is very confusing because that's usually used to mean something completely different.

wbhart wrote:

I note that the number of bus accesses is probably much lower than the number of cpu cycles required for the instructions in this case. I'm also not sure I understand when I should add bus access times to the instruction timings. Is that only if there is a holdup due to too many bus accesses? Or does the CPU always incur these costs regardless? If the latter, then I am having trouble figuring out the timings, as it seems to be entirely accounted for by CPU instruction timings! We are using some pretty hefty instructions.

The CPU can be thought of as two units operating asynchronously, for the most part. The execution unit (EU) is what actually executes the instructions once they are in the CPU, and the bus interface unit (BIU) handles reads and writes and fetches instruction bytes into the prefetch queue when neither is going on (and when the queue is not full). The published instruction timings are (roughly) EU-only timings, which is to say that they are the best-case timings when the prefetch queue is full and there are no bus holds or wait states. How long the EU has to wait for the BIU to do a read or a write depends on where it is in the prefetch bus cycle when the request is made. There are some counter-intuitive behaviours, like sometimes a instruction starting later will result in it finishing earlier.

wbhart wrote:

Logic dictates it must be towards the end of the instruction, as EA needs to be computed and so on. But it does seem like writes can take a cycle less than reads from memory, if the mov mem/reg, mem/reg timings are anything to go by.

Yes, the EU can fire off a write to the BIU and then immediately start the next instruction. Whereas for a read it must wait for the read value to come back from the BIU and do something with it (like put it in a register).

wbhart wrote:

I was thinking that a 5/5/6 cycle cadence would be optimal for CGA accesses based on the wait state information in your blog. That seems to imply that if the code is timed to request data from the CGA memory after 5, 10 or 15 cycles, that would be optimal, given that we cannot insert an extra cycle every 16 cycles (and DMA refresh probably ruins optimality anyway).

Yes, that sounds about right. The wait states introduced by the CGA are 3 cycles minimum, 8 cycles maximum. So if you are getting the worst case behaviour there is the potential to move an instruction of up to 5 cycles before that CGA access and have it execute for free. But in practice the wait states you see probably aren't going to be consistent from iteration to iteration, because the prefetch queue and bus state are also not going to be consistent. You never know without trying, though!

wbhart wrote:

By the way, I've seen people talking about turning off DMA refresh. But how does one's program code and data get refreshed if you do this? Or is it possible to selectively turn off DMA refresh for certain segments? (Not that this is a good piece of code to do this in. I am just curious.)

It is not possible to selectively turn it off. You can turn it off without endangering memory contents for a short period if you turn up the refresh rate to a high rate for a while first (to ensure all RAM has been recently refreshed) and do the same when you turn it back on again (to ensure that no row goes too long without a refresh). It's useful for doing precise and consistent timing measurements but is not really helpful for normal code. I used to think it would be possible to refresh by ensuring that each row was accessed by the running code (either by reads or prefetches) but that turns out only to refresh the bank being accessed, whereas the proper mechanism refreshes RAM in all the banks at once.

wbhart wrote:

I count 31 bytes of instructions, which should be 124 cycles for the prefetch hardware. That's 62 per pixel.

Nope. I don't understand the bus cycle business.

The EU and BIU largely run in parallel, so you can't just add together the BIU cycles and the EU cycles without knowing how many overlap. There will be quite a few cycles where the EU is waiting for the BIU (either because the next instruction hasn't been fetched yet or because the instruction needs to do a bus access and must wait for the current prefetch cycle to complete). So without knowing what is happening on each cycle of each instruction, you can't really count cycles accurately the way you can on simpler CPUs.

I managed to shave a few cycles off the verticalish lines. The times are now down to 17.95s or ~106.5 cycles/pixel.

I also dug an enormous amount of bugs and corner cases out of the code and wrote a verticalish line blanking routine.

This is enough to get my original demo working again that started this whole thing. I made a video showing it off here: https://youtu.be/hxNBbWojhaQ

The video starts with about a minute of footage of a retro hard drive, but it quickly gets on with the demo. There's a lot of flicker due to the camera, but towards the end of the video I mostly sort that out too. The result looks quite good I think.

So I decided to have another go at this cycle counting business. I feel like I have more confidence doing this with AMD CPU's of 2008 era than the original 8086 because one can often prove optimality by counting retirement of macro ops.

Anyhow, in the following I give two timings for each instruction. The first column is the number of clocks before the instruction is available to be executed due to not being in the prefetch queue. The second column will be the instruction execution timing as per the 8086 book. I'll assume a 4 byte prefetch queue which I will arbitrarily take to be empty at the beginning of the loop. I will assume there are 6 cycles of wait states on average on the bus with every CGA memory access, as we don't know what the numbers should really be. I'll also add 4 cycles on the bus for every byte of CGA memory read/written. I won't explicitly annotate those, but will take them into account when working out how many cycles the prefetch hardware has available to access the bus. Of course I will use a figure of 4 cycles per bus access, and I will assume each byte of instruction code requires a bus access. I'll also assume the prefetch queue is empty after a jump taken. At each conditional jump I'll split into two columns, one for taken and the other for not taken, and I will merge again if the times start to coincide at some point. At the end I'll take the average over all paths. I'll assume the loop is always taken.

The total of all the numbers I wrote down over an average path should approximate the total real world time after adding an extra 1/12 for DMA refresh.

The totals for the four paths through the code are 201, 193, 195, 187. The average of these is 194. Add 1/12 for DRAM refresh gives 210 cycles, or 105 cycles per pixel, which is close enough to the 108 I measured to be about right, especially given that the lead in and out code contributes a few cycles per pixel on average.

Of course I cheated a little by knowing what answer I was expecting.

I've now added a circle drawing routine (with adjustment for a 1.2:1 CGA pixel aspect ratio) in cga_draw_circle1() in cga5.asm. It currently only draws the right hand side of the circle, but the other side will be analogous.

After some timings, I get about 111 cycles/pixel on the 8088 @ 4.77MHz, which is about the same as for verticalish lines. Of course it's only so low because I'm exploiting symmetry to draw two pixels at once.

I can probably shave a couple of cycles off the verticalish part of the code, but I don't see how to speed up the horizontalish part because I ran out of registers.

I'll hopefully fill in the other half of the circle tomorrow some time.

I've now added special versions of circle drawing for colours 0 and 3 (which just use AND and OR), an XOR version and a circle blanking routine. The timings (on a 4.77MHz 8088) are now:

I thought about the ellipse code to see if it is possible to do a version of that for any size ellipse which uses only registers and only (short) conditional jumps in the main loops. But so far I have not come up with a way to do it, at least not in a meaningful way. There just aren't enough registers, it seems.

I thought about the ellipse code to see if it is possible to do a version of that for any size ellipse which uses only registers and only (short) conditional jumps in the main loops. But so far I have not come up with a way to do it, at least not in a meaningful way. There just aren't enough registers, it seems.

Not sure I'll have time to look at this very soon, but I do like a challenge! Can I see what you've got so far? It seems like it ought to be possible. The technique that springs to mind is using an immediate operand in the actual code as a "register". Slower than an actual register, but faster than spilling it completely to memory.

1patch: 2 mov ax,9999 ; This value is patched 3 add ax,9999 ; This value is also patched 4 mov [cs:patch+1],ax 5 add bx,ax

@reenigne The problem is that one needs 32 bit values. As I see it, there are at least three required. That's already 6 registers.

Here is some Julia code that shows the algorithm:

1function ellipse(r::Int, s::Int) 2 x = r; 3 y = 0; 4 r_orig = r 5 c = s*s 6 a = r*r 7 D = 0 8 xdelta = 2*c*r_orig 9 ydelta = a 10 while xdelta >= ydelta 11 println("(", x, ", ", y, ")") 12 D += ydelta; ydelta += 2a; y += 1; 13 if D >= div(xdelta, 2) 14 xdelta -= c; D -= xdelta; xdelta -= c; x -= 1; 15 end 16 end 17 D = -D 18 while xdelta >= 0 19 println("(", x, ", ", y, ")") 20 xdelta -= c; D += xdelta; xdelta -= c; x -= 1; 21 if D > div(ydelta, 2) 22 D -= ydelta; ydelta += 2a; y += 1 23 end 24 end 25 end

Naturally c and a can go in the code in the way you suggest. But xdelta, ydelta and D need much more than 16 bits. There are some tricks to squeeze an extra bit or two out here and there, but I think there's no way around needing two registers for each. Then you need AX for an accumulator (both halves), DI for the offset, at least one for the odd/even line thing, another to exploit symmetry in the ellipse.

Perhaps xdelta, ydelta and D can each be done in 24 bits. Then it might be possible, but only if you drop the symmetry trick.

The best code to look at is cga_draw_circle1/2 in CGA5.asm. It's precisely the algorithm in the Julia code above where r and s are set to 6 and 5 (you can always replace semiradii r, s with r/d, s/d where d = gcd(r, s), though r_orig in the code above cannot be divided by d. Of course if your ellipse has semiradii 158 and 99 then the gcd is 1, so this trick only saves you if you have a large d).

Last edited by wbhart on 2019-10-04, 15:09. Edited 2 times in total.

Yeah, 24 bits seems to be enough, just. So if I didn't miscalculate, it is possible in terms of the number of registers if you drop the exploitation of symmetry. But this isn't a slam dunk if you want to only use short jumps. The code for updating 24 bit values is not cheap, especially as you have to do both signed and unsigned comparisons (the code for circles was already a stretch).

Even if it's possible, it's totally not clear to me if a cleaned up version of my original ellipse code (in CGA.asm) would be faster with spilling but exploitation of symmetry, or whether a version with no spilling and no symmetry would be better. As usual, one has to write both to figure that out.

Yes, I see what you mean. Then I only have to (explicitly) write them and not read them from memory.

I'm not sure how possible that would be since both are accessed multiple times in each iteration. If one of those were not inside a conditional block it might work, since you'd know where it would be accessed next.

It's even worse if you consider breaking up into 4 or 8 cases to handle the video side of things, as you have one of two places where you will access the value.

It looks like each iteration of the two inner loops separates fairly nicely into an "xdelta live" section and a "ydelta live" section, except for the fact that in the first loop you need to compare xdelta against ydelta. I wonder if you can get away with just comparing the most significant words of those, or if that causes visible glitches.

I will check, but all attempts to lose any bits have so far failed for me. I've tried all sorts of things. It really seems like all bits are critical for the algorithm.

But I now see what you mean about keeping xdelta live, etc.

Of course it may be possible to replace the first comparison with a number of iterations. I'm not sure how expensive that would be to compute. Roughly speaking it can be expressed as some condition on the derivative, which should be just a linear condition, so it probably involves one division. It's difficult to get it exactly right though.

I looked through a bunch of old papers on ellipse algorithms, but there are only ones that halve the maximum integer that appears. So they save me one whole bit. Whoopdidoo!

There's a relatively short list of ellipses that yield glitches when comparing the most significant words of xdelta and ydelta. Most could be handled by ellipse code that only worked for small semiradii, leaving a very short list which would have to be handled by more general code.

In fact, here is an exhaustive list of such semiradii (assuming maximum semiradii of 160, 100 of course).

Below is what each pixel of the verticalish part of the ellipse looks like (modulo errors), with the following registers:

1 ; di, di+bx offsets of points above and below axis, ah:accumulator/scratch 2 ; dx:deltax (hi16), sp:yinc, bp:deltay (hi16), 3 ; al: D (lo8), ch: deltax (lo8), cl: deltay (lo8), si: D (hi16)

There's about 80 bytes there, which is just far too much. This would need to be well under 60 bytes in order for the conditional jumps to reach (there are four such sections, one for each value of the screen x coordinate modulo 4). Though I haven't ruled out that there is some clever rearrangement of the code that makes all the jumps reach.

If no such rearrangement exists, I think this implies that we need to do a more naive translation of the Julia code above (which I further simplified above), rather than splitting into cases for x mod 4. That of course means a higher cost for doing the mask and colour arithmetic, and also means storing at least 16 bits in cs as immediates to make space for both the mask and colour (and an extra byte for an accumulator).

The other option is to drop symmetry, but I think this still wont get us over the line.

1ellipse1_jump2: 2 mov ah, [di+bx] ; draw pixel above axis 3 and ah, 0cfh 4ellipse1_patch1: 5 or ah, 030h 6 mov [di+bx], ah 7 mov ah, [di] ; draw pixel below axis 8 and ah, 0cfh 9ellipse1_patch2: 10 or ah, 030h 11 mov [di], ah 12 13 add di, sp ; update offset 14ellipse1_patch3: 15 xor sp, 01234h ; update offset update for odd<->even 16 sub bx, 80 ; decrement/increment y lines 17 18 add ah, cl ; D += dy 19 adc si, bp 20ellipse1_patch4: 21 add cl, 012h ; dy += 2r^2 22ellipse1_patch5: 23 add bp, 01234h 24 25 shr dx, 1 ; if dx/2 <= D, decrement x 26 sahf 27 cmp dx, si 28 jle ellipse1_x1 29 30 lahf 31 rcl dx, 1 32 cmp dx, bp ; check if done verticalish 33 jae ellipse1_jump2 34 jmp ellipse1_donev2 ; done verticalish 35 36ellipse1_x2: 37 lahf 38 rcl dx, 1 39ellipse1_patch6: 40 sub ch, 012h ; dx -= s^2 41ellipse1_patch7: 42 sbb dx, 01234h 43 sub al, ch ; D -= dx 44 sbb si, dx 45ellipse1_patch8: 46 sub ch, 012h ; dx -= s^2 47ellipse1_patch9: 48 sbb dx, 01234h 49 50 cmp dx, bp ; check if done verticalish 51 jae ellipse1_jump2 52 jmp ellipse1_donev2 ; done verticalish

I think I may have found a reordering of the code which brings the maximum conditional jump down from over 160 to just over 110 bytes. It looks promising, but I'll have to implement it to make sure I didn't screw up. I'll probably do that tomorrow.

Below is how the code works out for the verticalish part of the ellipse.

Basically, if we number code sections 1, 1a, 2, 2a, 3, 3a, 4, 4a where section n is the section that deals with pixels that have x-coord equal to n-1 mod 4, and the "a" sections are the alternative endings for those sections, then the ordering of the code below is:

2, 2a, 3, 1a, 3a, 4, 4a, 1

Section n has to jump back to the start of itself, and the end of section na has to jump there too. The end of section n also has to jump to the start of section (n-1)a.

So with this ordering, the maximum jump is from the end of 1a to the start of 1 or from the end of 3a to the start of 3, both of which are equal to two "a" sections plus one "non-a" section.

By my calculation a "non-a" section is 52 bytes and an "a" section is 28 bytes (29 in the case of 4a). That makes the longest jump from the end of 1 to the start of 1a, which is 52 + 28 + 29 + 1 + 3 = 113 bytes, where the 1 is for alignment and the 3 is because the jump isn't quite at the end of a section.

The only thing that is worse than the circle code so far (apart from the huge number of patches and the fact that the computed jump is now a nightmare) is the 24 bit arithmetic, which is probably unavoidable, and the fact that there isn't a 16 bit accumulator free to move dx into so it can non-destructively be shifted right by 1. This necessitates doing the shift in dx itself, which in turn necessitates the use of lahf/sahf combinations to "save" the bit that is shifted out, so it can be shifted back in again.

I haven't managed to find a way around this, either mathematically or in code.

I guess the other slowdown is due to having to deal with the two symmetric points entirely separately, again due to only ah being available, and having to use ah instead of al, meaning stosb is not available, all due to needing lahf/sahf.

1 ; verticalish part of circle 2 ALIGN 2 3ellipse1_jump2: 4 mov ah, [di+bx] ; draw pixel above axis 5 and ah, 0cfh 6ellipse1_patch1: 7 or ah, 030h 8 mov [di+bx], ah 9 mov ah, [di] ; draw pixel below axis 10 and ah, 0cfh 11ellipse1_patch2: 12 or ah, 030h 13 mov [di], ah 14 15 add di, sp ; update offset 16ellipse1_patch3: 17 xor sp, 01234h ; update offset update for odd<->even 18 sub bx, 80 ; decrement/increment y lines 19 20 add al, cl ; D += dy 21 adc si, bp 22ellipse1_patch4: 23 add cl, 012h ; dy += 2r^2 24ellipse1_patch5: 25 adc bp, 01234h 26 27 shr dx, 1 ; if dx/2 <= D, decrement x 28 lahf 29 cmp dx, si 30 jle ellipse1_x1 31 32 sahf 33 rcl dx, 1 34 cmp dx, bp ; check if done verticalish 35 jae ellipse1_jump2 36 jmp ellipse1_donev2 ; done verticalish 37 38ellipse1_x2: 39 sahf 40 rcl dx, 1 41ellipse1_patch6: 42 sub ch, 012h ; dx -= s^2 43ellipse1_patch7: 44 sbb dx, 01234h 45 sub al, ch ; D -= dx 46 sbb si, dx 47ellipse1_patch8: 48 sub ch, 012h ; dx -= s^2 49ellipse1_patch9: 50 sbb dx, 01234h 51 52 cmp dx, bp ; check if done verticalish 53 jae ellipse1_jump2 54 jmp ellipse1_donev2 ; done verticalish 55 56 57 ALIGN 2 58ellipse1_jump3: 59 mov ah, [di+bx] ; draw pixel above axis 60 and ah, 0f3h

…Show last 161 lines

61ellipse1_patch10: 62 or ah, 0ch 63 mov [di+bx], ah 64 mov ah, [di] ; draw pixel below axis 65 and ah, 0f3h 66ellipse1_patch11: 67 or ah, 0ch 68 mov [di], ah 69 70 add di, sp ; update offset 71ellipse1_patch12: 72 xor sp, 01234h ; update offset update for odd<->even 73 sub bx, 80 ; decrement/increment y lines 74 75 add al, cl ; D += dy 76 adc si, bp 77ellipse1_patch13: 78 add cl, 012h ; dy += 2r^2 79ellipse1_patch14: 80 adc bp, 01234h 81 82 shr dx, 1 ; if dx/2 <= D, decrement x 83 lahf 84 cmp dx, si 85 jle ellipse1_x2 86 87 sahf 88 rcl dx, 1 89 cmp dx, bp ; check if done verticalish 90 jae ellipse1_jump3 91 jmp ellipse1_donev3 ; done verticalish 92 93ellipse1_x1: 94 sahf 95 rcl dx, 1 96ellipse1_patch15: 97 sub ch, 012h ; dx -= s^2 98ellipse1_patch16: 99 sbb dx, 01234h 100 sub al, ch ; D -= dx 101 sbb si, dx 102ellipse1_patch17: 103 sub ch, 012h ; dx -= s^2 104ellipse1_patch18: 105 sbb dx, 01234h 106 107 cmp dx, bp ; check if done verticalish 108 jae ellipse1_jump1 109 jmp ellipse1_donev1 ; done verticalish 110 111 112ellipse1_x3: 113 sahf 114 rcl dx, 1 115ellipse1_patch19: 116 sub ch, 012h ; dx -= s^2 117ellipse1_patch20: 118 sbb dx, 01234h 119 sub al, ch ; D -= dx 120 sbb si, dx 121ellipse1_patch21: 122 sub ch, 012h ; dx -= s^2 123ellipse1_patch22: 124 sbb dx, 01234h 125 126 cmp dx, bp ; check if done verticalish 127 jae ellipse1_jump3 128 jmp ellipse1_donev3 ; done verticalish 129 130 ALIGN 2 131ellipse1_jump4: 132 mov ah, [di+bx] ; draw pixel above axis 133 and ah, 0fch 134ellipse1_patch23: 135 or ah, 03h 136 mov [di+bx], ah 137 mov ah, [di] ; draw pixel below axis 138 and ah, 0fch 139ellipse1_patch24: 140 or ah, 03h 141 mov [di], ah 142 143 add di, sp ; update offset 144ellipse1_patch25: 145 xor sp, 01234h ; update offset update for odd<->even 146 sub bx, 80 ; decrement/increment y lines 147 148 add al, cl ; D += dy 149 adc si, bp 150ellipse1_patch26: 151 add cl, 012h ; dy += 2r^2 152ellipse1_patch27: 153 adc bp, 01234h 154 155 shr dx, 1 ; if dx/2 <= D, decrement x 156 lahf 157 cmp dx, si 158 jle ellipse1_x3 159 160 sahf 161 rcl dx, 1 162 cmp dx, bp ; check if done verticalish 163 jae ellipse1_jump4 164 jmp ellipse1_donev4 ; done verticalish 165 166 167ellipse1_x4: 168 dec di 169 sahf 170 rcl dx, 1 171ellipse1_patch28: 172 sub ch, 012h ; dx -= s^2 173ellipse1_patch29: 174 sbb dx, 01234h 175 sub al, ch ; D -= dx 176 sbb si, dx 177ellipse1_patch30: 178 sub ch, 012h ; dx -= s^2 179ellipse1_patch31: 180 sbb dx, 01234h 181 182 cmp dx, bp ; check if done verticalish 183 jae ellipse1_jump4 184 jmp ellipse1_donev4 ; done verticalish 185 186 ALIGN 2 187ellipse1_jump1: 188 mov ah, [di+bx] ; draw pixel above axis 189 and ah, 03fh 190ellipse1_patch32: 191 or ah, 0c0h 192 mov [di+bx], ah 193 mov ah, [di] ; draw pixel below axis 194 and ah, 03fh 195ellipse1_patch33: 196 or ah, 0c0h 197 mov [di], ah 198 199 add di, sp ; update offset 200ellipse1_patch34: 201 xor sp, 01234h ; update offset update for odd<->even 202 sub bx, 80 ; decrement/increment y lines 203 204 add al, cl ; D += dy 205 adc si, bp 206ellipse1_patch35: 207 add cl, 012h ; dy += 2r^2 208ellipse1_patch36: 209 adc bp, 01234h 210 211 shr dx, 1 ; if dx/2 <= D, decrement x 212 lahf 213 cmp dx, si 214 jle ellipse1_x4 215 216 sahf 217 rcl dx, 1 218 cmp dx, bp ; check if done verticalish 219 jae ellipse1_jump1 220 jmp ellipse1_donev1 ; done verticalish