By the way, is self-modifying code the standard way to preserve SP if you want to use it as a general register, or is there somewhere else I can conveniently stash it?
You just have to put it somewhere where you'll be able to find it again afterwards. I usually just reserve some space in the code segment for that, and save/restore with "mov [cs:savedSP],sp" and "mov sp,[cs:savedSP]" respectively. I don't consider that self-modifying code, since "savedSP" isn't actually in the code itself (just the code segment). Though now you mention it, "mov [cs:savedSP+1],sp" and "savedSP: mov sp,9999" are slightly shorter and faster!
I just tried that second option in the "slow" code, actually. This allows me to save one mov to ds because I can use dx instead, which allows me to modify dh and dl separately without first moving to ax. But there's no change in timing, which is surprising. I assumed this was due to unaligned loop and jump targets on the 8086, but putting a nop before the loop didn't improve things. I don't think that would be the explanation on the 8088 anyway.
I also realised that I might make mouse control of a drawing program or a game more flaky if I clear the interrupts so I can do this. I have to split my code into versions that are good for small demo effects and code that is good for drawing programs and games.
That will certainly help my ellipse code a little. It won't help as much as I thought, as I forgot that I needed three extra registers there, not one, and I was already using some ugly hack to use the stack as a read only register. I will be able to remove that ugly hack though.
I thought by this you meant adjusting SS and SP so that the stack is in the same place as before (with an extra 0-15 bytes pushed) but SP has the fixed value that you need in the routine. Though that is quite a lot of up-front work to save a few cycles per iteration. Your hack is clever, but I suspect that "pop ax / push ax" will be faster than "cli / pop ax / dec sp / dec sp / sti" in practice, as the former is 6 bus cycles and the latter is 7.
I went by cycle counts in the 8086 book, but I doubt they are always correct. It was only a 1 or 2 cycle saving based on their figures, but it was probably a waste of effort.
I could probably do a circle routine entirely in registers, but general ellipses are just hard. I was completely unable to find an algorithm for that, so I just adapted the midpoint circle algorithm. Unfortunately, I have no idea why the final routine works, as I brute forced it in the end to get it to work, and was unable to prove that it was correct.
I went through a similar process writing Bresenham ellipse code many years ago (after seeing the circle algorithm in a book). I hacked on it until I got it to work but I didn't really understand why it worked. I showed it to my maths teacher and he pointed out that if you rewrite the circle equation x^2 + y^2 = r^2 as (x/r)^2 + (y/r)^2 = 1 then the ellipse equation is (x/a)^2 + (y/b)^2 = 1. From that, I was able to determine that my ellipse code was correct. Unfortunately a circle routine isn't much use on CGA as it doesn't have square pixels!
I was able to derive a version of the midpoint algorithm for ellipses ok, using precisely that observation, but the issue I had was more subtle. I found that the code was not completing the ellipse. It would stop short because it would reach the scanline where the top of the ellipse was and not continue along that scanline until the middle of the top of the ellipse. The reason for this is that I didn't want to use an extra register to store where the ellipse should stop. I wanted the iteration to just automagically stop in the right place. That required more than a day of effort trying things out until I managed to get it to work.
By the way, I think the pixels are roughly 6 to 5 in ratio. These are very small values, and that is actually what determines whether I need all the extra registers in my ellipse code. I can't handle an ellipse with a semimajor to semiminor axis ratio of 319 to 199, but a ratio of 6 to 5 should be just fine.
I can't remember what the exact requirement is, but it is something like max(a*b^2, a^2*b) < 256.
The true benefit to unrolling isn't just getting rid of jumps, it's the fact that you can simplify the code by making it less general (essentially storing more bits of the program state in the instruction pointer). For example, if you have four copies of the inner loop, one for each of the four pixel positions within a byte, then you can get rid of all the rol/ror stuff and have the mask and OR bits directly in the code.
Yes, I see now. It's the same principle as a sprite compiler essentially. On the other hand, putting an explicit mask value in the code surely takes the same number of cycles as a ror 1, which I think is only 4. And one can't hard code the colour values because they aren't known (unless you duplicate all the code for every colour).
It may also be possible to improve speed by avoiding excess VRAM writes. VRAM writes are quite slow (almost 6 cycles of wait state per byte on average on a 4.77MHz 8088 with IBM CGA). So if you can organise the line drawing routine to only write to VRAM once you're done with that entire byte then that could speed things up. Or the overhead of keeping track might outweigh the savings - it's hard to tell without writing it.
I suspect this would speed it up for lines that are quite horizontal and slow it down for ones that are less horizontal. I've got to read and comment my "fast" line drawing code to remind myself what it now does. I definitely have tried things of that nature, but I don't remember exactly which tricks I am currently using. Either way, it is clear I can do better now.
Any big savings in the "fast" code, or an unrolled version of the "slow" code would surely now come from eliminating the conditional jump somehow. At the very least, two version of the code could exist, one for where the jump is currently almost always taken, and another for where it is almost always not. I spent some time last night trying to think of a way to combine it with the loop, since one can reorder the code so that the jump actually goes directly to the loop. But in the end I just concluded this was effectively the same effect as unrolling, even if it were possible.