I really don't have any explanation for it. Nothing makes sense, because one sees the output on the screen, so it isn't as if I could be running the wrong executable. It's the first time I had it working at all, so I don't see how it could have been the wrong code. It's mystifying.
Maybe the random number generator that you are using makes longer lines with some seeds than with others (the LCG generators that are normally used for random() do have problems like this sometimes).
I'm not drawing random lines, unfortunately. I am still mystified by this. I've been over all the possibilities I can think of and still don't see how it's possible. The best I can come up with is that I timed it at 19.7s three times in a row, noted with satisfaction that this is faster than the horizontalish lines (in absolute time), which it isn't, but only because the verticalish lines are 200 pixels and the horizontalish ones 320 long, then forgot the number I measured three times while I walked down the hallway to type it into this forum. I can just about exclude every other possibility! I am really genuinely mystified. I mean, I checked it three times!
Oh I'm totally wrong! The 8086 book uses the term "direction flag", which was completely misleading me.
The direction flag controls whether DI is incremented or decremented by STOSB (and similarly for other string instructions). Normally it's left clear so that DI is incremented.
Sure, which is why it totally shouldn't be used to describe mov mem/reg, mem/reg. It had me believing that reg would be on the left if the direction flag was 0 and on the right if 1. But that is not what they mean at all. There is a bit in the instruction encoding itself that they are calling the direction flag. It actually says, "d is the direction flag. If d = 0...." To compound things, I was unable to figure out why the number of bytes in my computed jump was off by two. I eventually decided that it must be because the assembler emits a std before the mov and a cli after the mov. Of course this is totally illogical, but it fit all the evidence I had at the time. I should have checked more carefully, e.g. by disassembly.
As the verticalish lines have to do precisely two CGA memory accesses per pixel, it seems like it could be a candidate for syncing with the CGA clock to avoid CGA wait states.
Perhaps! Though the programmer has very little control over this. About all you can do, I think, is to try rearranging things and see if it makes it faster. And because the routine is already quite highly constrained, about the only rearrangement that I can see is moving the "add dx,bp" lines above the "mov [bx+di],al" lines, the "stosb" lines, or both.
Yes, I will certainly try some superoptimisation. It didn't seem to affect the horizontalish lines at all and aligning the loop targets to 16 bit boundaries also does nothing for horizontalish or verticalish lines. Presumably the prefetch is already full and this isn't a problem at this point, even with the additional bus access.
I note that the number of bus accesses is probably much lower than the number of cpu cycles required for the instructions in this case. I'm also not sure I understand when I should add bus access times to the instruction timings. Is that only if there is a holdup due to too many bus accesses? Or does the CPU always incur these costs regardless? If the latter, then I am having trouble figuring out the timings, as it seems to be entirely accounted for by CPU instruction timings! We are using some pretty hefty instructions.
What's not clear to me is at what point during execution of a memory access instruction the CPU will make the request on the bus. It seems to be relevant because the number of cycles the memory access instructions take is not the same. Thus whether the wait state occurs at the beginning or end of the instruction will affect the syncronization.
I can make some cycle-by-cycle traces if you like. But I'm not sure how enlightening they will be. The cycle that you consider the "start" of an instruction or the start of a bus cycle is a bit arbitrary. And the rules are annoyingly complex, and sometimes get broken by DRAM refresh DMAs anyway.
Logic dictates it must be towards the end of the instruction, as EA needs to be computed and so on. But it does seem like writes can take a cycle less than reads from memory, if the mov mem/reg, mem/reg timings are anything to go by.
I was thinking that a 5/5/6 cycle cadence would be optimal for CGA accesses based on the wait state information in your blog. That seems to imply that if the code is timed to request data from the CGA memory after 5, 10 or 15 cycles, that would be optimal, given that we cannot insert an extra cycle every 16 cycles (and DMA refresh probably ruins optimality anyway).
By the way, I've seen people talking about turning off DMA refresh. But how does one's program code and data get refreshed if you do this? Or is it possible to selectively turn off DMA refresh for certain segments? (Not that this is a good piece of code to do this in. I am just curious.)