VOGONS

Common searches


CGA Graphics library

Topic actions

Reply 20 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I did some slightly more careful timings and it seems that the loop unrolling makes far more difference than I thought. I have been having trouble with GitHub serving up old versions of files and I suspect what happened is I ran an older version when timing after I did the loop unrolling.

Anyhow, the fastest version is around 80 cycles give or take, per pixel. XOR instead of AND/OR is about 10% faster. For some reason a version that just writes the bytes without reading graphics memory doesn't seem to be any faster than that. You'd expect it to be faster, as it doesn't incur read times from CGA memory, which the XOR must. So that's a mystery at this point. I will investigate.

This coming week I'll work on verticalish lines of course. There I don't have many good ideas. I had already unrolled by two. I can of course use the 2 instruction xor trick of Reenigne to speed that up, but after that I haven't thought of any new ideas. I'm not keen on writing many different cases, and anyway the slowness of jumps will prevent that from being practical. I'll see what I can come up with though.

YouTube Channel - PCRetroTech

Reply 21 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

For those still playing along at home, I did some timings and 60 frames of 66 lines of width 320 now takes ~21.30s, which works out to just about exactly 80 cycles per pixel. That's almost a 10% speedup over the course of the week (my previous estimate of 75 cycles before this speedup turns out to have not been very accurate).

I've tried many other things without speeding it up, so I think this (cga5.asm : line1) is the fastest horizontalish general line drawing I can manage.

There is one more trick possible however, When drawing black or white lines, one can just use a single and or or instruction, respectively, instead of a read/and/or/write combination. Of course this multiplies the amount of code needed, but for the sake of the fastest possible performance, I will implement this.

I've also now extensively cleaned up line1 in cga5.asm. Barring any bugs, this is the final version of this function.

I will also write a version that doesn't clear the interrupts, obviously at some cost. I'd like that to be very short code, and so loop unrolling might not be a good idea for that version. But I'm simply not sure if I can get close to the same performance without it. The one disadvantage of the current line1 is that it is not very good for short lines. The lead in code is very long and expensive. The original "slow" version I wrote has the advantage of being fast for short lines.

YouTube Channel - PCRetroTech

Reply 23 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

xor'ing or and'ing or or'ing the lines brings the time down to 18.7s overall, or about 70 cycles per pixel. Obviously if and and or are used to draw lines of colour 0 and 3 respectively, general random colour lines average out at 75 cycles per pixel.

That's a nice figure to be at, since it is around half what we started with.

Line blanking is at 13.6s, which is around 51 cycles/pixel. This makes much more sense than the previous figures I had for this.

Obviously verticalish lines are next, though I can't think of many tricks there.

YouTube Channel - PCRetroTech

Reply 24 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

xor'ing or and'ing or or'ing the lines brings the time down to 18.7s overall, or about 70 cycles per pixel. Obviously if and and or are used to draw lines of colour 0 and 3 respectively, general random colour lines average out at 75 cycles per pixel.

Very nice indeed! I think that must be very close to optimal, if not there already.

wbhart wrote:

Obviously verticalish lines are next, though I can't think of many tricks there.

If you point me at the best you've got so far, I'll see what ideas I can come up with! I'm afraid I kind of lost track of which versions of the routine are what.

Reply 25 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

@reenigne So far for verticalish lines there's only the original versions of the routines I wrote before posting here. These are in cga.asm in _cga_draw_line3 and _cga_draw_line4.

One of them is for slightly right moving and the other for slightly left moving verticalish lines.

cga5.asm is the only other file worth looking at, but it only contains the new implementations for horizontalish lines. The _00 version is for lines of colour 0, _11 for lines of colour 3. xor and blank are self explanatory. line1 is the fastest general line drawing for horizontalish lines. The write1 version writes the bits into the byte and sets all the other bits to zero. It's not useful unless you have lines that are nowhere near each other or any other feature you don't want overwritten. I'll probably remove it, as it's not very useful in practice and only a slight bit faster than other more general routines.

For verticalish lines, the only improvements I think I can make are your xor trick and to unroll to do two pixels at once (which I think I already do), one on an odd line and one an even line. Technically I think I could have two versions, one starting on an odd line and another starting on an even line so that the 8192 updates can be hard coded, instead of using the xor trick. But I can also just handle that with a computed jump into a loop unrolled by two, to save having two separate versions of the code.

I don't see any sensible way of handling x-increments without breaking up into many cases, which I'm not that keen on. I guess if I could do all the cases in less than 127 bytes so that only short jumps are required, it would be worth doing.

For the horizontalish lines I tried earlier this week to write an extra version swapping the jump taken/not taken cases. This would speed up lines that are more horizontal than not. But this meant duplicating parts of the unrolled loop many times as these are not if...else statements but just if statements, so all the rest of the loop must be repeated to make use of this trick. I couldn't quite get this into 127 bytes. It was something like 145 bytes, so I gave up on that. It's just too messy and bloated, even for a superfast no-holds-barred version.

I don't know until I try, but I suspect breaking verticalish lines into 8 cases and jumping between them in a kind of state machine approach would also suffer from the same problems. I think verticalish lines are just inherently slower due to having to write a byte for every pixel in the line.

YouTube Channel - PCRetroTech

Reply 26 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

Technically I think I could have two versions, one starting on an odd line and another starting on an even line so that the 8192 updates can be hard coded, instead of using the xor trick.

Yes, I think this is worth doing.

wbhart wrote:

I don't see any sensible way of handling x-increments without breaking up into many cases, which I'm not that keen on. I guess if I could do all the cases in less than 127 bytes so that only short jumps are required, it would be worth doing.

I think you can do it all with short jumps. It's actually more than 127 bytes but if you're careful to arrange things so that you never jump from one end of the routine to the other then that doesn't matter. Here's what I came up with (untested):

  push bp
push ds
push si
push di
mov bx,80-8192-1
mov ax,0xb800
mov es,ax
mov ds,ax
; TODO: set up initial bp, dx, si, di, cx
; TODO: jump to appropriate starting routine


vline_0_0:
dec cx
jz vline_done
vline_0_0_nocheckdone:
mov al,[di]
and al,0x3f
or al,0x80
stosb
add di,8191
add si,bp
jle vline_0_1
sub si,dx

vline_1_1:
dec cx
jz vline_done
mov al,[di]
and al,0xcf
or al,0x20
stosb
add di,bx
add si,bp
jle vline_1_0
sub si,dx

vline_2_0:
dec cx
jz vline_done
mov al,[di]
and al,0xf3
or al,0x08
stosb
add di,8191
add si,bp
jle vline_2_1
sub si,dx

vline_3_1:
dec cx
jz vline_done
mov al,[di]
and al,0xfc
or al,0x02
stosb
add di,bx
add si,bp
jle vline_3_0
sub si,dx
Show last 60 lines
  loop vline_0_0_nocheckdone

vline_done:
pop di
pop si
pop ds
pop bp
ret

vline_0_1:
dec cx
jz vline_done
vline_0_1_nocheckdone
mov al,[di]
and al,0x3f
or al,0x80
stosb
add di,bx
add si,bp
jle vline_0_0
sub si,dx

vline_1_0:
dec cx
jz vline_done
mov al,[di]
and al,0xcf
or al,0x20
stosb
add di,8191
add si,bp
jle vline_1_1
sub si,dx

vline_2_1:
dec cx
jz vline_done
mov al,[di]
and al,0xf3
or al,0x08
stosb
add di,bx
add si,bp
jle vline_2_0
sub si,dx

vline_3_0:
dec cx
jz vline_done
mov al,[di]
and al,0xfc
or al,0x02
stosb
add di,8191
add si,bp
jle vline_3_1
sub si,dx
loop vline_0_1_nocheckdone
jmp vline_done

Tricks used here:

  • Unrolled 8 cases and arranged them so that they fall through as often as possible and short jump to the other diagonal when necessary.
  • Put the end of the routine in the middle so that it is a short jump from both sides.
  • Both DS and ES point to VRAM, avoiding segment overrides.
  • Used LOOP when we can (only one in every 4 pixels, but still worthwhile).

This is for rightward (\) lines - you'll need to duplicate and rearrange for leftward. Then you'll need to duplicate again for each of the 4 colours, XOR and erase (so I've only written 1/12th of the full verticalish routine here if you implement all those variations). Colours 0 and 3 will be faster, as you've discovered.

Some minor possible optimisations I didn't implement:

  • It might be better to put 8191 in DX and use self-modifying code to patch an immediate value for the "horizontal move" si adjust case. But this is only worthwhile for longer lines because there are 8 places in CS: to patch.
  • It might be better to put -8192 in bx and use "mov [di+bx],al" instead of stosb for even scanlines. The "+bx" doesn't involve any extra code bytes so is approximately free. Then you only need to adjust di every other scanline and it's with a one-byte immediate instead of a two-byte immediate.
wbhart wrote:

I think verticalish lines are just inherently slower due to having to write a byte for every pixel in the line.

Yes, I think so too.

Reply 27 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

@reenigne Ah, you probably won't believe I came up with this independently now, but here is what I've been working on this afternoon. It uses basically exactly the same ideas you had. And yes, I think it works, with a few bytes to spare.

  PUBLIC _cga_draw_line2
_cga_draw_line2 PROC
ARG x0:WORD, y0:WORD, xdiff:WORD, ydiff:WORD, D:WORD, yend:WORD, colour:BYTE
; line from (x0, y0) - (?, yend) including endpoints
; AL: colour, BX: ydelta, CX: Loop, DX: D, SP: 2*dy, BP: 2*dx,
; SI: ??, DI: Offset, DS:B800, ES: B800
push bp
mov bp, sp
push di
push si
push ds

mov ax, 0b800h ; set ES to segment for CGA memory
mov es, ax
mov ds, ax ; reflect in DS

sub dx, bp ; compensate for first addition of 2*dx - 2*dy

line2_loop1:

mov al, [di]
and al, 03fh
line2_patch1:
or al, 0c0h
stosb
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx21
add dx, sp ; D += 2*dy
line2_incx11:
add di, 8191

mov al, [di]
and al, 03fh
line2_patch2:
or al, 0c0h
stosb
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx22
add dx, sp ; D += 2*dy
line2_incx12:
sub di, 8112

loop line2_loop1
jmp line2_no_iter


line2_loop4:

mov al, [di]
and al, 0fch
line2_patch3:
or al, 03h
stosb
inc di ; move to next byte, maybe?
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx11
dec di
add dx, sp ; D += 2*dy
line2_incx41:
add di, 8191
Show last 81 lines

mov al, [di]
and al, 0fch
line2_patch4:
or al, 03h
stosb
inc di ; move to next byte, maybe?
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx12
dec di
add dx, sp ; D += 2*dy
line2_incx42:
sub di, 8112

loop line2_loop4
jmp line2_no_iter


line2_loop2:

mov al, [di]
and al, 0cfh
line2_patch5:
or al, 030h
stosb
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx31
add dx, sp ; D += 2*dy
line2_incx21:
add di, 8191

mov al, [di]
and al, 0cfh
line2_patch6:
or al, 030h
stosb
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx32
add dx, sp ; D += 2*dy
line2_incx22:
sub di, 8112

loop line2_loop2
jmp line2_no_iter


line2_loop3:

mov al, [di]
and al, 0f3h
line2_patch3:
or al, 0ch
stosb
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx41
add dx, sp ; D += 2*dy
line2_incx31:
add di, 8191

mov al, [di]
and al, 0f3h
line2_patch4:
or al, 0ch
stosb
add dx, bp ; D += 2*dx - 2*dy
jg line2_incx42
add dx, sp ; D += 2*dy
line2_incx32:
sub di, 8112

loop line2_loop3

line2_no_iter:

pop ds
pop si
pop di
pop bp
ret
_cga_draw_line2 ENDP

It's incomplete, but the ideas are there. I'll try and clean it up tonight and compare more closely with yours, which I've just seen now.

Last edited by wbhart on 2019-08-30, 21:56. Edited 1 time in total.

YouTube Channel - PCRetroTech

Reply 28 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

@reenigne Actually, I think your approach is slightly different. You have the 8 cases in 2 sets of 4, where I have them in 4 sets of 2, I think.

It would actually be interesting to compare the two approaches.

YouTube Channel - PCRetroTech

Reply 29 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie
reenigne wrote:
  • It might be better to put -8192 in bx and use "mov [di+bx],al" instead of stosb for even scanlines. The "+bx" doesn't involve any extra code bytes so is approximately free. Then you only need to adjust di every other scanline and it's with a one-byte immediate instead of a two-byte immediate.

That's a nice trick. There are still two registers free in my implementation (BP and SP in the new version I'm about to commit), so there's plenty of chances to do improvements like this.

YouTube Channel - PCRetroTech

Reply 30 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

The newest version (_cga_draw_line2 in cga5.asm) seems to be more or less correct, modulo some todos. I can now time it.

I do lines from 0,0 to i,199 with i in [0, 198] in increments of 3, i.e. 67 lines. I repeat 60 times. The total time is 17.3s, which works out to about 103 cycles/pixel. Edit: no it's not, it's 19.7s, which is ~117 cycles/pixel.

Last edited by wbhart on 2019-08-31, 12:01. Edited 1 time in total.

YouTube Channel - PCRetroTech

Reply 31 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie
wbhart wrote:
reenigne wrote:
  • It might be better to put -8192 in bx and use "mov [di+bx],al" instead of stosb for even scanlines. The "+bx" doesn't involve any extra code bytes so is approximately free. Then you only need to adjust di every other scanline and it's with a one-byte immediate instead of a two-byte immediate.

That's a nice trick. There are still two registers free in my implementation (BP and SP in the new version I'm about to commit), so there's plenty of chances to do improvements like this.

Unfortunately this doesn't quite work. The address stosb writes to isn't correct if you do this.

if you replace stosb with an explicit move you add 3 bytes (the direction flag must be set and cleared for a move in that direction). There's also an additional two cycles for effective address calculation over a stosb. This doesn't appear to balance the savings elsewhere.

YouTube Channel - PCRetroTech

Reply 32 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

Unfortunately this doesn't quite work. The address stosb writes to isn't correct if you do this.

if you replace stosb with an explicit move you add 3 bytes (the direction flag must be set and cleared for a move in that direction). There's also an additional two cycles for effective address calculation over a stosb. This doesn't appear to balance the savings elsewhere.

Hmm... this is what I'm thinking of. Have I missed something?

  push bp
push ds
push si
push di
mov bx,-8192
mov ax,0xb800
mov es,ax
mov ds,ax
; TODO: set up initial bp, dx, si, di, cx
; TODO: jump to appropriate starting routine


vline_0_0:
dec cx
jz vline_done
vline_0_0_nocheckdone:
mov al,[di+bx]
and al,0x3f
or al,0x80
mov [di+bx],al
add si,bp
jle vline_0_1
sub si,dx

vline_1_1:
dec cx
jz vline_done
mov al,[di]
and al,0xcf
or al,0x20
stosb
add di,79
add si,bp
jle vline_1_0
sub si,dx

vline_2_0:
dec cx
jz vline_done
mov al,[di+bx]
and al,0xf3
or al,0x08
mov [di+bx],al
add si,bp
jle vline_2_1
sub si,dx

vline_3_1:
dec cx
jz vline_done
mov al,[di]
and al,0xfc
or al,0x02
stosb
add di,79
add si,bp
jle vline_3_0
sub si,dx
loop vline_0_0_nocheckdone

Show last 55 lines
vline_done:
pop di
pop si
pop ds
pop bp
ret

vline_0_1:
dec cx
jz vline_done
vline_0_1_nocheckdone:
mov al,[di]
and al,0x3f
or al,0x80
stosb
add di,79
add si,bp
jle vline_0_0
sub si,dx

vline_1_0:
dec cx
jz vline_done
mov al,[di+bx]
and al,0xcf
or al,0x20
mov [di+bx],al
add si,bp
jle vline_1_1
sub si,dx

vline_2_1:
dec cx
jz vline_done
mov al,[di]
and al,0xf3
or al,0x08
stosb
add di,79
add si,bp
jle vline_2_0
sub si,dx

vline_3_0:
dec cx
jz vline_done
mov al,[di+bx]
and al,0xfc
or al,0x02
mov [di+bx],al
add si,bp
jle vline_3_1
sub si,dx
loop vline_0_1_nocheckdone
jmp vline_done

DI here points to the next byte to write in the second bank (odd scanlines). The stosb happens on the odd scanlines, so then we adjust by 79 afterwards to move down to the next pair of lines. BX is 0xe000 (-8192) so we add that on to the effective address to get to the even scanlines, and there is no adjustment necessary. This saves one bus cycle per pixel on average by my reckoning, so should be a win. Though admittedly I haven't tried applying this to your version of the code - it might be more difficult to get it to work there.

Reply 33 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Something weird is going on. The timing that was 17.3s last night is 19.7s this morning. I do have a dodgy trimpot in my PC, and my understanding is it changes the clock frequency slightly, to alter the colours in NTSC output. But I don't know whether it could make that much difference. I'm quite puzzled, as I timed it three times last night to make sure.

Anyhow, your trick does work and it is a speedup of about ~0.5s overall. I was simply being misled by the timings being different today than what I reported yesterday. I just checked out the commit from last night to test this.

But note that mov [di+bx], al actually emits three assembly instructions for a total of 4 bytes, instead of one byte for stosb. So it's more expensive than it looks.

YouTube Channel - PCRetroTech

Reply 34 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

Something weird is going on. The timing that was 17.3s last night is 19.7s this morning. I do have a dodgy trimpot in my PC, and my understanding is it changes the clock frequency slightly, to alter the colours in NTSC output. But I don't know whether it could make that much difference. I'm quite puzzled, as I timed it three times last night to make sure.

The trimpot won't make any difference at all for a 4.77MHz machine since the crystal it adjusts clocks both the CPU and the PIT that you are using for timing. Speeding both up by the same amount won't affect the reported value. But even if the CPU is clocked by a different crystal the adjustment is tiny, not 10%+. There may be variations in measurement due to the phases of different clocks but again this seems very high for that sort of effect. Could you have been compiling the wrong version of the code, or running an out of date executable?

wbhart wrote:

But note that mov [di+bx], al actually emits three assembly instructions for a total of 4 bytes, instead of one byte for stosb. So it's more expensive than it looks.

"mov [di+bx],al" should be a single two-byte instruction (88 01).

Reply 35 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie
reenigne wrote:
wbhart wrote:

Something weird is going on. The timing that was 17.3s last night is 19.7s this morning. I do have a dodgy trimpot in my PC, and my understanding is it changes the clock frequency slightly, to alter the colours in NTSC output. But I don't know whether it could make that much difference. I'm quite puzzled, as I timed it three times last night to make sure.

The trimpot won't make any difference at all for a 4.77MHz machine since the crystal it adjusts clocks both the CPU and the PIT that you are using for timing. Speeding both up by the same amount won't affect the reported value. But even if the CPU is clocked by a different crystal the adjustment is tiny, not 10%+. There may be variations in measurement due to the phases of different clocks but again this seems very high for that sort of effect. Could you have been compiling the wrong version of the code, or running an out of date executable?

Yeah it's not the trimpot, I checked. I'm not timing using the PIT, but just a stopwatch. The timings can be out by +/-0.1s this way, but not 2.4s!

I really don't have any explanation for it. Nothing makes sense, because one sees the output on the screen, so it isn't as if I could be running the wrong executable. It's the first time I had it working at all, so I don't see how it could have been the wrong code. It's mystifying.

reenigne wrote:
wbhart wrote:

But note that mov [di+bx], al actually emits three assembly instructions for a total of 4 bytes, instead of one byte for stosb. So it's more expensive than it looks.

"mov [di+bx],al" should be a single two-byte instruction (88 01).

Oh I'm totally wrong! The 8086 book uses the term "direction flag", which was completely misleading me.

Now I don't have a clue why I had to add 3 bytes to my computed jumps after changing from stosb to this. I must have messed something up somewhere or other.

YouTube Channel - PCRetroTech

Reply 36 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I implemented your [bx+di] trick again. No problems with byte counting this time. I don't know what I did wrong before.

Anyhow, It is at 18.2s now, or ~108 cycles/pixel.

I don't see any obvious improvements now. I think my version uses one less instruction per pixel on average than yours, but I didn't count bus cycles. If you think it will be faster, I'll try implementing it at some point.

YouTube Channel - PCRetroTech

Reply 37 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie

As the verticalish lines have to do precisely two CGA memory accesses per pixel, it seems like it could be a candidate for syncing with the CGA clock to avoid CGA wait states.

What's not clear to me is at what point during execution of a memory access instruction the CPU will make the request on the bus. It seems to be relevant because the number of cycles the memory access instructions take is not the same. Thus whether the wait state occurs at the beginning or end of the instruction will affect the syncronization.

YouTube Channel - PCRetroTech

Reply 38 of 85, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote:

I really don't have any explanation for it. Nothing makes sense, because one sees the output on the screen, so it isn't as if I could be running the wrong executable. It's the first time I had it working at all, so I don't see how it could have been the wrong code. It's mystifying.

Maybe the random number generator that you are using makes longer lines with some seeds than with others (the LCG generators that are normally used for random() do have problems like this sometimes).

wbhart wrote:

Oh I'm totally wrong! The 8086 book uses the term "direction flag", which was completely misleading me.

The direction flag controls whether DI is incremented or decremented by STOSB (and similarly for other string instructions). Normally it's left clear so that DI is incremented.

wbhart wrote:

I don't see any obvious improvements now. I think my version uses one less instruction per pixel on average than yours,

I was sceptical but your trick of only checking the iteration count every other scanline and fixing up the last pixel at the end could very well do it - impressive!

wbhart wrote:

As the verticalish lines have to do precisely two CGA memory accesses per pixel, it seems like it could be a candidate for syncing with the CGA clock to avoid CGA wait states.

Perhaps! Though the programmer has very little control over this. About all you can do, I think, is to try rearranging things and see if it makes it faster. And because the routine is already quite highly constrained, about the only rearrangement that I can see is moving the "add dx,bp" lines above the "mov [bx+di],al" lines, the "stosb" lines, or both.

wbhart wrote:

What's not clear to me is at what point during execution of a memory access instruction the CPU will make the request on the bus. It seems to be relevant because the number of cycles the memory access instructions take is not the same. Thus whether the wait state occurs at the beginning or end of the instruction will affect the syncronization.

I can make some cycle-by-cycle traces if you like. But I'm not sure how enlightening they will be. The cycle that you consider the "start" of an instruction or the start of a bus cycle is a bit arbitrary. And the rules are annoyingly complex, and sometimes get broken by DRAM refresh DMAs anyway.

Reply 39 of 85, by wbhart

User metadata
Rank Newbie
Rank
Newbie
reenigne wrote:
wbhart wrote:

I really don't have any explanation for it. Nothing makes sense, because one sees the output on the screen, so it isn't as if I could be running the wrong executable. It's the first time I had it working at all, so I don't see how it could have been the wrong code. It's mystifying.

Maybe the random number generator that you are using makes longer lines with some seeds than with others (the LCG generators that are normally used for random() do have problems like this sometimes).

I'm not drawing random lines, unfortunately. I am still mystified by this. I've been over all the possibilities I can think of and still don't see how it's possible. The best I can come up with is that I timed it at 19.7s three times in a row, noted with satisfaction that this is faster than the horizontalish lines (in absolute time), which it isn't, but only because the verticalish lines are 200 pixels and the horizontalish ones 320 long, then forgot the number I measured three times while I walked down the hallway to type it into this forum. I can just about exclude every other possibility! I am really genuinely mystified. I mean, I checked it three times!

reenigne wrote:
wbhart wrote:

Oh I'm totally wrong! The 8086 book uses the term "direction flag", which was completely misleading me.

The direction flag controls whether DI is incremented or decremented by STOSB (and similarly for other string instructions). Normally it's left clear so that DI is incremented.

Sure, which is why it totally shouldn't be used to describe mov mem/reg, mem/reg. It had me believing that reg would be on the left if the direction flag was 0 and on the right if 1. But that is not what they mean at all. There is a bit in the instruction encoding itself that they are calling the direction flag. It actually says, "d is the direction flag. If d = 0...." To compound things, I was unable to figure out why the number of bytes in my computed jump was off by two. I eventually decided that it must be because the assembler emits a std before the mov and a cli after the mov. Of course this is totally illogical, but it fit all the evidence I had at the time. I should have checked more carefully, e.g. by disassembly.

reenigne wrote:
wbhart wrote:

As the verticalish lines have to do precisely two CGA memory accesses per pixel, it seems like it could be a candidate for syncing with the CGA clock to avoid CGA wait states.

Perhaps! Though the programmer has very little control over this. About all you can do, I think, is to try rearranging things and see if it makes it faster. And because the routine is already quite highly constrained, about the only rearrangement that I can see is moving the "add dx,bp" lines above the "mov [bx+di],al" lines, the "stosb" lines, or both.

Yes, I will certainly try some superoptimisation. It didn't seem to affect the horizontalish lines at all and aligning the loop targets to 16 bit boundaries also does nothing for horizontalish or verticalish lines. Presumably the prefetch is already full and this isn't a problem at this point, even with the additional bus access.

I note that the number of bus accesses is probably much lower than the number of cpu cycles required for the instructions in this case. I'm also not sure I understand when I should add bus access times to the instruction timings. Is that only if there is a holdup due to too many bus accesses? Or does the CPU always incur these costs regardless? If the latter, then I am having trouble figuring out the timings, as it seems to be entirely accounted for by CPU instruction timings! We are using some pretty hefty instructions.

reenigne wrote:
wbhart wrote:

What's not clear to me is at what point during execution of a memory access instruction the CPU will make the request on the bus. It seems to be relevant because the number of cycles the memory access instructions take is not the same. Thus whether the wait state occurs at the beginning or end of the instruction will affect the syncronization.

I can make some cycle-by-cycle traces if you like. But I'm not sure how enlightening they will be. The cycle that you consider the "start" of an instruction or the start of a bus cycle is a bit arbitrary. And the rules are annoyingly complex, and sometimes get broken by DRAM refresh DMAs anyway.

Logic dictates it must be towards the end of the instruction, as EA needs to be computed and so on. But it does seem like writes can take a cycle less than reads from memory, if the mov mem/reg, mem/reg timings are anything to go by.

I was thinking that a 5/5/6 cycle cadence would be optimal for CGA accesses based on the wait state information in your blog. That seems to imply that if the code is timed to request data from the CGA memory after 5, 10 or 15 cycles, that would be optimal, given that we cannot insert an extra cycle every 16 cycles (and DMA refresh probably ruins optimality anyway).

By the way, I've seen people talking about turning off DMA refresh. But how does one's program code and data get refreshed if you do this? Or is it possible to selectively turn off DMA refresh for certain segments? (Not that this is a good piece of code to do this in. I am just curious.)

YouTube Channel - PCRetroTech