VOGONS

Common searches


CGA Graphics library

Topic actions

First post, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Recently I've been teaching myself how to program the 8088/86 processor and the CGA card.

I've been writing some really fast routines for drawing lines, circles and ellipses, with and without clipping. They are much faster than the Borland Graphics Interface, so I thought I'd share them in case there are others here interested in this stuff.

As I've gotten into it, I've realised I've overengineered this stuff and the code alone is taking up too much memory. So I've started trying to write shorter routines with maybe 80% of the performance. This morning I wrote a routine for drawing a line that is a third of the size of my original code, but still 80% of the performance.

You can find the code here:

https://github.com/wbhart/CGAGraphics

Most of the routines are in cga.asm and cgademo.c. The new stuff is in cga2.asm and cgademo2.c, but there's only a tiny bit there so far.

I also made a few videos about some effects that I've programmed up with this code:

https://youtu.be/ojOq6c0wIrg - CGA Demo including clipped ellipses
https://youtu.be/gJCZrNeFlY0 - Starfield Demo
https://youtu.be/WUPxpACbE9k - Fast vertical lines (code analysis)
https://youtu.be/gO8mAd1N4ks - 3D Animated Tetrahedron (lines in space)

YouTube Channel - PCRetroTech

Reply 1 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

Very nice indeed! I always enjoy a good bit of highly optimised 8088 assembly, and CGA Bresenham line drawing routines are really fun too.

Looking at your implementation, it's pretty good already but I can spot a fair few possible improvements. Consider having separate routines for drawing and erasing lines. Erasing can be a lot faster since you don't have to read bytes from VRAM, mask them and then write them back - you can just stosb the entire byte and move horizontally 4 pixels at a time instead of 1. Though that does mean erasing and redrawing everything on the screen at once, unless you know that the stuff that you don't want to erase is in different bytes than the stuff you do.

There's probably a faster way to change the address when you move up or down. Consider moving down, the address changes by +8192, +80-8192. +8192, +80-8192. The delta just keeps alternating between those two values, so there is no need to do an AND, just an "add di,delta". And the delta for the next line can be computed by "delta = 80-delta" or "sub delta, 80 / neg delta" in assembly. Since you no longer have any local variables in the inner loop, you can keep delta in bp and then you don't need to use ax for that part at all, which also means that you don't need to mess around with ds.

There are some even more significant speedups to be had if you're prepared to sacrifice some memory to unroll things. I have some code somewhere for a line drawing routine that is 6 instructions per pixel for horizontal-ish lines and 8 instructions per pixel for vertical-ish lines. Though those use XOR plotting - pixel replacement would be a bit slower. And they also use the stack pointer as a general register so have to run with interrupts turned off which makes them unsuitable for a lot of applications. The inner loops total 6800 bytes of code.

Reply 2 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Thanks very much for taking a look at my code. Those are some really cool suggestions.

I must admit I don't actually know how many cycles per pixel the fast version is at. Roughly speaking I think I can do around 200 horizontalish lines per second on an 8MHz CPU, with each line being 320 pixels. So that's clearly quite a lot of cycles.

Your idea for stepping down the lines is particularly clever, and I really don't know why I didn't think of that. But I think you may be looking at the slow version of the code. I don't think I have any registers spare in the fast code. I do kind of unroll by 2 in the fast version, but I can't go any further with that due to the cost of long jumps. It wasn't until the 386 I think, that jumps weren't such a problem. So although this idea is good for the slow version of the code, I don't think I can use it in the fast version. But I'll think about it. Still, it would be cool if the "slow" version ended up being just as fast as the "fast" version with some additional tricks.

I guess for the rotating tetrahedron demo I could use your line erasing idea. That might be problematic around the corners of the window, though as you say, that could move by 4 pixels at a time. I didn't give myself enough time to write that, so clearing around the edges of the window if it moved 4 pixels at a time was just going to be too much code to write to get my video out this weekend. In the end, the thing doesn't look very impressive because it is rotating by a very small amount and the window is shifting by one pixel at a time, so it looks quite slow. I think it will look more dramatic if everything is moving further each iteration.

I guess I originally set out to write a fairly general purpose library for CGA graphics. I had in mind a drawing program (that could be used to make graphics for games). PC Paintbrush is quite good, but it's hard to add new tools. So I vaguely thought I might write my own. But obviously if you are writing demo effects, you can do lots of tricks that you can't do in general purpose code (most of which I don't actually know).

I think I might eventually do a followup to my original rotating tetrahedron video. I certainly wanted to show the difference when the thing was made really smooth, but I've also thought of some other things I want to try out.

YouTube Channel - PCRetroTech

Reply 3 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

By the way, I had concluded at some point that using SP as a general register was impossible. I will have to revisit that now, as my ellipse code really needed an extra register to be truly fast. And once you are short one register, you are short two. I have a super fast version of my ellipse code that only works for tiny ellipses, but possibly with the extra register I can finally sort that out for big ellipses.

It's a shame there is not more 8088/8086 optimisation info online. Yours and trixter's blog articles around 8088 mph are about the best sources of info I've encountered. At least they provided some inspiration for the sprite code I wrote a while back.

YouTube Channel - PCRetroTech

Reply 4 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

Thanks very much for taking a look at my code. Those are some really cool suggestions.

Glad you like them!

wbhart wrote:

I must admit I don't actually know how many cycles per pixel the fast version is at. Roughly speaking I think I can do around 200 horizontalish lines per second on an 8MHz CPU, with each line being 320 pixels. So that's clearly quite a lot of cycles.

That's 125 cycles per pixel then, which seems roughly right from looking at the code. You can do some timing (latch and read from PIT channel 0 in mode 2) if you want to measure it more precisely. To get a rough idea, what I usually do is add up the number of bytes of instructions executed, memory bytes read/written and IO port bytes read/written (i.e. the total number of bus cycles). That usually seems to get the right answer for "which of these two pieces of code is faster", at least on 8088.

wbhart wrote:

Your idea for stepping down the lines is particularly clever, and I really don't know why I didn't think of that.

I thought of an improvement: you can do "xor delta, -16304" instead of "sub delta, 80 / neg delta" to switch between the odd->even and even->odd deltas. If you are short of register space you can store "delta" in the code itself, as the immediate operand of the "add di,delta" instruction.

wbhart wrote:

But I think you may be looking at the slow version of the code.

I was, since it sounded like that was what you had switched your focus to.

wbhart wrote:

I don't think I have any registers spare in the fast code. I do kind of unroll by 2 in the fast version, but I can't go any further with that due to the cost of long jumps.

If you unroll by 2 but have the two copies of the routine be for even and odd scanlines respectively that might be a good optimisation.

Another improvement you could make to the fast version is permuting the registers around - if you swap bp and dx, then dx and cx you can use loop instead of dec/jnz.

wbhart wrote:

I guess I originally set out to write a fairly general purpose library for CGA graphics. I had in mind a drawing program (that could be used to make graphics for games). PC Paintbrush is quite good, but it's hard to add new tools. So I vaguely thought I might write my own. But obviously if you are writing demo effects, you can do lots of tricks that you can't do in general purpose code (most of which I don't actually know).

Yeah, for the really fast stuff it pays to write a routine customised to exactly what you're doing.

wbhart wrote:

I think I might eventually do a followup to my original rotating tetrahedron video. I certainly wanted to show the difference when the thing was made really smooth, but I've also thought of some other things I want to try out.

I look forward to seeing the results!

wbhart wrote:

It's a shame there is not more 8088/8086 optimisation info online. Yours and trixter's blog articles around 8088 mph are about the best sources of info I've encountered.

This is something that I'd like to do something about at some point! But there are so many projects and not enough time to tackle them all...

Reply 5 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie
reenigne wrote:
wbhart wrote:

I don't think I have any registers spare in the fast code. I do kind of unroll by 2 in the fast version, but I can't go any further with that due to the cost of long jumps.

If you unroll by 2 but have the two copies of the routine be for even and odd scanlines respectively that might be a good optimisation.

Another improvement you could make to the fast version is permuting the registers around - if you swap bp and dx, then dx and cx you can use loop instead of dec/jnz.

In the fast version I do handle odd and even scanlines separately in the verticalish line code, in separate loops (if I recall correctly). Permuting the registers is a good idea though. For the horizontalish code it's better to focus on trying to do two pixels horizontally in a row, I think.

However, I am not sure if we are using the words horizontalish and verticalish the same way. I'm sure you could probably do better for lines that move at least two pixels to the right for every pixel down they move. But at this point I'm definitely looking to save memory, rather than time.

reenigne wrote:

I thought of an improvement: you can do "xor delta, -16304" instead of "sub delta, 80 / neg delta" to switch between the odd->even and even->odd deltas. If you are short of register space you can store "delta" in the code itself, as the immediate operand of the "add di,delta" instruction.

Ah, that's another one I missed. Funnily I remember thinking about something like this way back, but I was focused on the value itself, not the delta.

Of course, the slow version of the code handles both positive and negative slope horizontalish lines, so this trick doesn't quite work for both cases. But of course one can just use self-modifying code. I think that might actually be the only way to do it. I couldn't think of a fast bit twiddle that would work for both cases.

By the way, is self-modifying code the standard way to preserve SP if you want to use it as a general register, or is there somewhere else I can conveniently stash it? That will certainly help my ellipse code a little. It won't help as much as I thought, as I forgot that I needed three extra registers there, not one, and I was already using some ugly hack to use the stack as a read only register. I will be able to remove that ugly hack though.

I could probably do a circle routine entirely in registers, but general ellipses are just hard. I was completely unable to find an algorithm for that, so I just adapted the midpoint circle algorithm. Unfortunately, I have no idea why the final routine works, as I brute forced it in the end to get it to work, and was unable to prove that it was correct.

YouTube Channel - PCRetroTech

Reply 6 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I tried your xor trick and the times for the "slow" and "fast" versions are now the same.

After a short conversation with myself, I convinced myself to be happy about that.

It's now clear the fast version can be sped up, e.g. by saving SP and using the same trick in that code or unrolling the "slow" version, and by breaking into separate cases based on slope: [0,0.5), [0.5-1), etc. The thing is, I'm not really sure how much time will be saved by unrolling. I loose only 8 cycles per iteration plus a little more for one less instruction, don't I? So maybe 5-6% faster? That would be much less than applying the deltay/xor trick to the original "fast" code.

Anyhow, both versions are now 200 lines of 320 pixels in 1s on the 8MHz 8086 or 2s on the 4.77MHz 8088. So that's 150 cycles per pixel on the 8088.

YouTube Channel - PCRetroTech

Reply 7 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

In the fast version I do handle odd and even scanlines separately in the verticalish line code, in separate loops (if I recall correctly).

Ah, it didn't look like you were using "add di,8192" in one case and "sub di,8192-80" in the other - unless I'm mistaken that's a relatively simple optimisation then.

wbhart wrote:

However, I am not sure if we are using the words horizontalish and verticalish the same way.

I'm using verticalish to mean that you move 1 or more pixels vertically for every one that you move horizontally, and horizontalish (more horizontal than vertical) for the other way around.

wbhart wrote:

I'm sure you could probably do better for lines that move at least two pixels to the right for every pixel down they move. But at this point I'm definitely looking to save memory, rather than time.

In that case I won't spend a lot of energy on shaving off cycles at the expense of larger code or tables! Of course if you really want to get into sizecoding then you'd want to write the whole thing in assembly.

wbhart wrote:

Of course, the slow version of the code handles both positive and negative slope horizontalish lines, so this trick doesn't quite work for both cases. But of course one can just use self-modifying code. I think that might actually be the only way to do it. I couldn't think of a fast bit twiddle that would work for both cases.

Yeah, it's just a question of setting the initial delta (+8192, -8192+80, -8192 or +8192-80) at the start of the routine and the XOR value (-80 or -16304) depending on which direction you're going.

wbhart wrote:

By the way, is self-modifying code the standard way to preserve SP if you want to use it as a general register, or is there somewhere else I can conveniently stash it?

You just have to put it somewhere where you'll be able to find it again afterwards. I usually just reserve some space in the code segment for that, and save/restore with "mov [cs:savedSP],sp" and "mov sp,[cs:savedSP]" respectively. I don't consider that self-modifying code, since "savedSP" isn't actually in the code itself (just the code segment). Though now you mention it, "mov [cs:savedSP+1],sp" and "savedSP: mov sp,9999" are slightly shorter and faster!

wbhart wrote:

That will certainly help my ellipse code a little. It won't help as much as I thought, as I forgot that I needed three extra registers there, not one, and I was already using some ugly hack to use the stack as a read only register. I will be able to remove that ugly hack though.

I thought by this you meant adjusting SS and SP so that the stack is in the same place as before (with an extra 0-15 bytes pushed) but SP has the fixed value that you need in the routine. Though that is quite a lot of up-front work to save a few cycles per iteration. Your hack is clever, but I suspect that "pop ax / push ax" will be faster than "cli / pop ax / dec sp / dec sp / sti" in practice, as the former is 6 bus cycles and the latter is 7.

wbhart wrote:

I could probably do a circle routine entirely in registers, but general ellipses are just hard. I was completely unable to find an algorithm for that, so I just adapted the midpoint circle algorithm. Unfortunately, I have no idea why the final routine works, as I brute forced it in the end to get it to work, and was unable to prove that it was correct.

I went through a similar process writing Bresenham ellipse code many years ago (after seeing the circle algorithm in a book). I hacked on it until I got it to work but I didn't really understand why it worked. I showed it to my maths teacher and he pointed out that if you rewrite the circle equation x^2 + y^2 = r^2 as (x/r)^2 + (y/r)^2 = 1 then the ellipse equation is (x/a)^2 + (y/b)^2 = 1. From that, I was able to determine that my ellipse code was correct. Unfortunately a circle routine isn't much use on CGA as it doesn't have square pixels!

wbhart wrote:

It's now clear the fast version can be sped up, e.g. by saving SP and using the same trick in that code or unrolling the "slow" version, and by breaking into separate cases based on slope: [0,0.5), [0.5-1), etc. The thing is, I'm not really sure how much time will be saved by unrolling. I loose only 8 cycles per iteration plus a little more for one less instruction, don't I? So maybe 5-6% faster? That would be much less than applying the deltay/xor trick to the original "fast" code.

The true benefit to unrolling isn't just getting rid of jumps, it's the fact that you can simplify the code by making it less general (essentially storing more bits of the program state in the instruction pointer). For example, if you have four copies of the inner loop, one for each of the four pixel positions within a byte, then you can get rid of all the rol/ror stuff and have the mask and OR bits directly in the code.

It may also be possible to improve speed by avoiding excess VRAM writes. VRAM writes are quite slow (almost 6 cycles of wait state per byte on average on a 4.77MHz 8088 with IBM CGA). So if you can organise the line drawing routine to only write to VRAM once you're done with that entire byte then that could speed things up. Or the overhead of keeping track might outweigh the savings - it's hard to tell without writing it.

Reply 8 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie
reenigne wrote:
wbhart wrote:

By the way, is self-modifying code the standard way to preserve SP if you want to use it as a general register, or is there somewhere else I can conveniently stash it?

You just have to put it somewhere where you'll be able to find it again afterwards. I usually just reserve some space in the code segment for that, and save/restore with "mov [cs:savedSP],sp" and "mov sp,[cs:savedSP]" respectively. I don't consider that self-modifying code, since "savedSP" isn't actually in the code itself (just the code segment). Though now you mention it, "mov [cs:savedSP+1],sp" and "savedSP: mov sp,9999" are slightly shorter and faster!

I just tried that second option in the "slow" code, actually. This allows me to save one mov to ds because I can use dx instead, which allows me to modify dh and dl separately without first moving to ax. But there's no change in timing, which is surprising. I assumed this was due to unaligned loop and jump targets on the 8086, but putting a nop before the loop didn't improve things. I don't think that would be the explanation on the 8088 anyway.

I also realised that I might make mouse control of a drawing program or a game more flaky if I clear the interrupts so I can do this. I have to split my code into versions that are good for small demo effects and code that is good for drawing programs and games.

reenigne wrote:
wbhart wrote:

That will certainly help my ellipse code a little. It won't help as much as I thought, as I forgot that I needed three extra registers there, not one, and I was already using some ugly hack to use the stack as a read only register. I will be able to remove that ugly hack though.

I thought by this you meant adjusting SS and SP so that the stack is in the same place as before (with an extra 0-15 bytes pushed) but SP has the fixed value that you need in the routine. Though that is quite a lot of up-front work to save a few cycles per iteration. Your hack is clever, but I suspect that "pop ax / push ax" will be faster than "cli / pop ax / dec sp / dec sp / sti" in practice, as the former is 6 bus cycles and the latter is 7.

I went by cycle counts in the 8086 book, but I doubt they are always correct. It was only a 1 or 2 cycle saving based on their figures, but it was probably a waste of effort.

reenigne wrote:
wbhart wrote:

I could probably do a circle routine entirely in registers, but general ellipses are just hard. I was completely unable to find an algorithm for that, so I just adapted the midpoint circle algorithm. Unfortunately, I have no idea why the final routine works, as I brute forced it in the end to get it to work, and was unable to prove that it was correct.

I went through a similar process writing Bresenham ellipse code many years ago (after seeing the circle algorithm in a book). I hacked on it until I got it to work but I didn't really understand why it worked. I showed it to my maths teacher and he pointed out that if you rewrite the circle equation x^2 + y^2 = r^2 as (x/r)^2 + (y/r)^2 = 1 then the ellipse equation is (x/a)^2 + (y/b)^2 = 1. From that, I was able to determine that my ellipse code was correct. Unfortunately a circle routine isn't much use on CGA as it doesn't have square pixels!

I was able to derive a version of the midpoint algorithm for ellipses ok, using precisely that observation, but the issue I had was more subtle. I found that the code was not completing the ellipse. It would stop short because it would reach the scanline where the top of the ellipse was and not continue along that scanline until the middle of the top of the ellipse. The reason for this is that I didn't want to use an extra register to store where the ellipse should stop. I wanted the iteration to just automagically stop in the right place. That required more than a day of effort trying things out until I managed to get it to work.

By the way, I think the pixels are roughly 6 to 5 in ratio. These are very small values, and that is actually what determines whether I need all the extra registers in my ellipse code. I can't handle an ellipse with a semimajor to semiminor axis ratio of 319 to 199, but a ratio of 6 to 5 should be just fine.

I can't remember what the exact requirement is, but it is something like max(a*b^2, a^2*b) < 256.

reenigne wrote:

The true benefit to unrolling isn't just getting rid of jumps, it's the fact that you can simplify the code by making it less general (essentially storing more bits of the program state in the instruction pointer). For example, if you have four copies of the inner loop, one for each of the four pixel positions within a byte, then you can get rid of all the rol/ror stuff and have the mask and OR bits directly in the code.

Yes, I see now. It's the same principle as a sprite compiler essentially. On the other hand, putting an explicit mask value in the code surely takes the same number of cycles as a ror 1, which I think is only 4. And one can't hard code the colour values because they aren't known (unless you duplicate all the code for every colour).

reenigne wrote:

It may also be possible to improve speed by avoiding excess VRAM writes. VRAM writes are quite slow (almost 6 cycles of wait state per byte on average on a 4.77MHz 8088 with IBM CGA). So if you can organise the line drawing routine to only write to VRAM once you're done with that entire byte then that could speed things up. Or the overhead of keeping track might outweigh the savings - it's hard to tell without writing it.

I suspect this would speed it up for lines that are quite horizontal and slow it down for ones that are less horizontal. I've got to read and comment my "fast" line drawing code to remind myself what it now does. I definitely have tried things of that nature, but I don't remember exactly which tricks I am currently using. Either way, it is clear I can do better now.

Any big savings in the "fast" code, or an unrolled version of the "slow" code would surely now come from eliminating the conditional jump somehow. At the very least, two version of the code could exist, one for where the jump is currently almost always taken, and another for where it is almost always not. I spent some time last night trying to think of a way to combine it with the loop, since one can reorder the code so that the jump actually goes directly to the loop. But in the end I just concluded this was effectively the same effect as unrolling, even if it were possible.

YouTube Channel - PCRetroTech

Reply 9 of 85, by pan069

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

I went by cycle counts in the 8086 book, but I doubt they are always correct. It was only a 1 or 2 cycle saving based on their figures, but it was probably a waste of effort.

Just skimming over this thread; going by cycle counts from manuals to gauge performance is usually not a good idea. Those counts normally don't take into account the time of instruction fetching and bus wait-states (e.g. writing to video memory).

Checkout Michael Abrash's Graphics Programming Black Book, it discusses these topics in-depth:

https://www.phatcode.net/res/224/files/html/index.html

Specifically have a read through the following chapter:

Chapter 4 - In the Lair of the Cycle-Eaters
https://www.phatcode.net/res/224/files/html/ch04/04-01.html

By the way, great work!

Reply 11 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

@reenigne I tried your xor trick in my "fast" code and the result was dramatic. It shaved nearly a quarter second off on the 8088 for horizontalish lines. My timing probably isn't very accurate (I'm still too lazy to hook up Abrash's zen timer to my code.) But the difference is actually really noticeable.

I would guess it's somewhere around 110-120 cycles per pixel now. But I now have a feeling that ~75 is possible for general (non-xored, any colour over any background image) lines, at least on the 8086. Of course I've realised the colour is irrelevant since you can just patch the colours in with self-modification. That should probably be the next thing I try.

Tomorrow I will try an unroll of the "slow" code along these lines. It might take me more than an evening to write that code though.

Your xor trick probably allows me to save some more cycles now, as my short jumps have some extra room. I remember having to make some compromises there.

There is an algorithm for line drawing that tries to draw multiple pixels at once, essentially picking from a bunch of patterns. But I decided that it would be too hard to implement on the CGA card. I think it is called Xiaolin Wu's algorithm.

YouTube Channel - PCRetroTech

Reply 14 of 85, by VileR

User metadata
Rank l33t
Rank
l33t

Excellent project! I might have an actual use for something like this in the foreseeable future. 😉

Speaking of Abrash, here's another article of his that may be of interest for you. It's a sprite/tile engine, but still CGA, and some of the considerations discussed are of course always relevant.
https://archive.org/details/PC_Tech_Journal_v … 4_n08/page/n127

[ WEB ] - [ BLOG ] - [ TUBE ] - [ CODE ]

Reply 15 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

I just tried that second option in the "slow" code, actually. This allows me to save one mov to ds because I can use dx instead, which allows me to modify dh and dl separately without first moving to ax. But there's no change in timing, which is surprising. I assumed this was due to unaligned loop and jump targets on the 8086, but putting a nop before the loop didn't improve things. I don't think that would be the explanation on the 8088 anyway.

No, the 8088 doesn't have unaligned address penalties like the 8086 does. There are a few possibilities why it might not have improved anything even if you decreased the number of bus cycles. The IBM CGA's wait states, for one thing, align to 16-hdot boundaries. And sometimes DRAM refresh DMAs can have a similar effect.

wbhart wrote:

I went by cycle counts in the 8086 book, but I doubt they are always correct. It was only a 1 or 2 cycle saving based on their figures, but it was probably a waste of effort.

Yes, to elaborate on what pan069 said - the published cycle counts are best case and don't take into account interactions with the prefetcher which can make quite a bit of difference.

wbhart wrote:

By the way, I think the pixels are roughly 6 to 5 in ratio.

Yes, that is the nominal value I usually use (assumes the full 320x200 active area has a 4:3 aspect ratio).

wbhart wrote:

Yes, I see now. It's the same principle as a sprite compiler essentially. On the other hand, putting an explicit mask value in the code surely takes the same number of cycles as a ror 1, which I think is only 4.

A "ror ch,1" is a two byte instruction, so on an 8088 it's 8 cycles for the instruction fetch (which is usually the bottleneck there).

wbhart wrote:

And one can't hard code the colour values because they aren't known (unless you duplicate all the code for every colour).

Or patch them up with self-modifying code.

wbhart wrote:

I suspect this would speed it up for lines that are quite horizontal and slow it down for ones that are less horizontal.

Yes, I've fallen down this rabbit hole in the past too - in pursuit of the fastest possible code it's very easy to neglect one case for another, or end up with an explosion of routines, each optimal for a particular case.

Reply 16 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I tried unrolling the "slow" version of the code by 4. I didn't bother writing the fixup at the end since it isn't needed to get an idea of the speed. At worst the last three pixels are not drawn.

I used a computed jump into the unrolled loop and removed all the obvious bits that were no longer necessary, using constants for the colour mask and bits.

It's still necessary to decrement di as stosb increments it. But otherwise it looks quite short.

The speed is basically the same as my "fast" code now. I tried moving some instructions around in case it helped with filling the prefetch at the right times, but that made no difference.

For nearly horizontal lines the jumps are going to almost always be taken, which is slow, and for less horizontal lines the jumps won't be taken but it will instead have to increment y using the xor trick, which compensates for not having taken the jump. As far as I can see, the performance is quite smooth as a function of slope.

It's not at all clear what cycle eater is still lurking now. Each of the four parts of the loop is now just 20 bytes, though most of this is inside the conditional jump sections and not always run.

One possibility remaining is to only issue stosb inside the condition jump sections and have the bits to be written to screen memory accumulate somehow in al. But I think this won't be faster because of all the extra bit twiddling that would be needed. My fast code already bundled two writes together when it can.

I'll clean the code up and add it to the GitHub repo, hopefully tomorrow.

YouTube Channel - PCRetroTech

Reply 17 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I've uploaded cga4.asm to my GItHub which has the unrolled "slow" version of the code. It doesn't bother with the last few pixels and it only handles horizontalish lines with positive slope (increasing y).

I think I may have an idea of how to combine reads and writes to the CGA memory for multiple pixels at once. I'll implement that and give it a try.

YouTube Channel - PCRetroTech

Reply 18 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Success!!

The file cga5.asm in my GitHub repo (see first post) is by far the fastest on the 8088. It takes only 1s to do 200 horizontalish lines of 320 pixels. That's about 75 cycles per pixel, which matches what I predicted was possible.

I haven't tested it extensively, but it should be roughly correct. Next week some time I'll finish off the TODO's which won't affect the speed but which will make it more general.

I would be very surprised if it is possible to do better in this generality. I guess that stands as a challenge now, as I don't think I can do substantially better. Of course this is using numerous tricks by reenigne. The only thing that I can still think of is superoptimisation to try and keep the prefetch queue full. But there isn't so much flexibility there, and I would have to set up accurate timing to do this. I doubt it will make a substantial difference.

YouTube Channel - PCRetroTech

Reply 19 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

@VileRancour Wow! That is one hell of an article by Abrash, complete with listings! Thanks very much for pointing it out.

I'm curious what other things were in that journal now.

YouTube Channel - PCRetroTech