CGA Graphics library

Reply 20 of 85, by wbhart

Posted on 2019-08-24, 20:32

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

I did some slightly more careful timings and it seems that the loop unrolling makes far more difference than I thought. I have been having trouble with GitHub serving up old versions of files and I suspect what happened is I ran an older version when timing after I did the loop unrolling.

Anyhow, the fastest version is around 80 cycles give or take, per pixel. XOR instead of AND/OR is about 10% faster. For some reason a version that just writes the bytes without reading graphics memory doesn't seem to be any faster than that. You'd expect it to be faster, as it doesn't incur read times from CGA memory, which the XOR must. So that's a mystery at this point. I will investigate.

This coming week I'll work on verticalish lines of course. There I don't have many good ideas. I had already unrolled by two. I can of course use the 2 instruction xor trick of Reenigne to speed that up, but after that I haven't thought of any new ideas. I'm not keen on writing many different cases, and anyway the slowness of jumps will prevent that from being practical. I'll see what I can come up with though.

YouTube Channel - PCRetroTech

Reply 21 of 85, by wbhart

Posted on 2019-08-29, 21:18

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

For those still playing along at home, I did some timings and 60 frames of 66 lines of width 320 now takes ~21.30s, which works out to just about exactly 80 cycles per pixel. That's almost a 10% speedup over the course of the week (my previous estimate of 75 cycles before this speedup turns out to have not been very accurate).

I've tried many other things without speeding it up, so I think this (cga5.asm : line1) is the fastest horizontalish general line drawing I can manage.

There is one more trick possible however, When drawing black or white lines, one can just use a single and or or instruction, respectively, instead of a read/and/or/write combination. Of course this multiplies the amount of code needed, but for the sake of the fastest possible performance, I will implement this.

I've also now extensively cleaned up line1 in cga5.asm. Barring any bugs, this is the final version of this function.

I will also write a version that doesn't clear the interrupts, obviously at some cost. I'd like that to be very short code, and so loop unrolling might not be a good idea for that version. But I'm simply not sure if I can get close to the same performance without it. The one disadvantage of the current line1 is that it is not very good for short lines. The lead in code is very long and expensive. The original "slow" version I wrote has the advantage of being fast for short lines.

YouTube Channel - PCRetroTech

Reply 22 of 85, by pan069

Posted on 2019-08-29, 22:04

pan069 Offline

Rank Oldbie

Rank: Oldbie
Posts: 964
Joined: 2018-12-06, 06:30

wbhart wrote:
For those still playing along at home

Still following your journey... 😀

Reply 23 of 85, by wbhart

Posted on 2019-08-30, 00:19

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

xor'ing or and'ing or or'ing the lines brings the time down to 18.7s overall, or about 70 cycles per pixel. Obviously if and and or are used to draw lines of colour 0 and 3 respectively, general random colour lines average out at 75 cycles per pixel.

That's a nice figure to be at, since it is around half what we started with.

Line blanking is at 13.6s, which is around 51 cycles/pixel. This makes much more sense than the previous figures I had for this.

Obviously verticalish lines are next, though I can't think of many tricks there.

YouTube Channel - PCRetroTech

Reply 24 of 85, by reenigne

Posted on 2019-08-30, 08:47

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote:
xor'ing or and'ing or or'ing the lines brings the time down to 18.7s overall, or about 70 cycles per pixel. Obviously if and and or are used to draw lines of colour 0 and 3 respectively, general random colour lines average out at 75 cycles per pixel.

Very nice indeed! I think that must be very close to optimal, if not there already.

wbhart wrote:
Obviously verticalish lines are next, though I can't think of many tricks there.

If you point me at the best you've got so far, I'll see what ideas I can come up with! I'm afraid I kind of lost track of which versions of the routine are what.

Reply 25 of 85, by wbhart

Posted on 2019-08-30, 14:08

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

@reenigne So far for verticalish lines there's only the original versions of the routines I wrote before posting here. These are in cga.asm in _cga_draw_line3 and _cga_draw_line4.

One of them is for slightly right moving and the other for slightly left moving verticalish lines.

cga5.asm is the only other file worth looking at, but it only contains the new implementations for horizontalish lines. The _00 version is for lines of colour 0, _11 for lines of colour 3. xor and blank are self explanatory. line1 is the fastest general line drawing for horizontalish lines. The write1 version writes the bits into the byte and sets all the other bits to zero. It's not useful unless you have lines that are nowhere near each other or any other feature you don't want overwritten. I'll probably remove it, as it's not very useful in practice and only a slight bit faster than other more general routines.

For verticalish lines, the only improvements I think I can make are your xor trick and to unroll to do two pixels at once (which I think I already do), one on an odd line and one an even line. Technically I think I could have two versions, one starting on an odd line and another starting on an even line so that the 8192 updates can be hard coded, instead of using the xor trick. But I can also just handle that with a computed jump into a loop unrolled by two, to save having two separate versions of the code.

I don't see any sensible way of handling x-increments without breaking up into many cases, which I'm not that keen on. I guess if I could do all the cases in less than 127 bytes so that only short jumps are required, it would be worth doing.

For the horizontalish lines I tried earlier this week to write an extra version swapping the jump taken/not taken cases. This would speed up lines that are more horizontal than not. But this meant duplicating parts of the unrolled loop many times as these are not if...else statements but just if statements, so all the rest of the loop must be repeated to make use of this trick. I couldn't quite get this into 127 bytes. It was something like 145 bytes, so I gave up on that. It's just too messy and bloated, even for a superfast no-holds-barred version.

I don't know until I try, but I suspect breaking verticalish lines into 8 cases and jumping between them in a kind of state machine approach would also suffer from the same problems. I think verticalish lines are just inherently slower due to having to write a byte for every pixel in the line.

YouTube Channel - PCRetroTech

Reply 26 of 85, by reenigne

Posted on 2019-08-30, 16:28

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote:
Technically I think I could have two versions, one starting on an odd line and another starting on an even line so that the 8192 updates can be hard coded, instead of using the xor trick.

Yes, I think this is worth doing.

wbhart wrote:
I don't see any sensible way of handling x-increments without breaking up into many cases, which I'm not that keen on. I guess if I could do all the cases in less than 127 bytes so that only short jumps are required, it would be worth doing.

I think you can do it all with short jumps. It's actually more than 127 bytes but if you're careful to arrange things so that you never jump from one end of the routine to the other then that doesn't matter. Here's what I came up with (untested):

1  push bp
2  push ds
3  push si
4  push di
5  mov bx,80-8192-1
6  mov ax,0xb800
7  mov es,ax
8  mov ds,ax
9  ; TODO: set up initial bp, dx, si, di, cx
10  ; TODO: jump to appropriate starting routine
11
12
13vline_0_0:
14  dec cx
15  jz vline_done
16vline_0_0_nocheckdone:
17  mov al,[di]
18  and al,0x3f
19  or al,0x80
20  stosb
21  add di,8191
22  add si,bp
23  jle vline_0_1
24  sub si,dx
25
26vline_1_1:
27  dec cx
28  jz vline_done
29  mov al,[di]
30  and al,0xcf
31  or al,0x20
32  stosb
33  add di,bx
34  add si,bp
35  jle vline_1_0
36  sub si,dx
37
38vline_2_0:
39  dec cx
40  jz vline_done
41  mov al,[di]
42  and al,0xf3
43  or al,0x08
44  stosb
45  add di,8191
46  add si,bp
47  jle vline_2_1
48  sub si,dx
49
50vline_3_1:
51  dec cx
52  jz vline_done
53  mov al,[di]
54  and al,0xfc
55  or al,0x02
56  stosb
57  add di,bx
58  add si,bp
59  jle vline_3_0
60  sub si,dx

…Show last 60 lines

61  loop vline_0_0_nocheckdone
62
63vline_done:
64  pop di
65  pop si
66  pop ds
67  pop bp
68  ret
69
70vline_0_1:
71  dec cx
72  jz vline_done
73vline_0_1_nocheckdone
74  mov al,[di]
75  and al,0x3f
76  or al,0x80
77  stosb
78  add di,bx
79  add si,bp
80  jle vline_0_0
81  sub si,dx
82
83vline_1_0:
84  dec cx
85  jz vline_done
86  mov al,[di]
87  and al,0xcf
88  or al,0x20
89  stosb
90  add di,8191
91  add si,bp
92  jle vline_1_1
93  sub si,dx
94
95vline_2_1:
96  dec cx
97  jz vline_done
98  mov al,[di]
99  and al,0xf3
100  or al,0x08
101  stosb
102  add di,bx
103  add si,bp
104  jle vline_2_0
105  sub si,dx
106
107vline_3_0:
108  dec cx
109  jz vline_done
110  mov al,[di]
111  and al,0xfc
112  or al,0x02
113  stosb
114  add di,8191
115  add si,bp
116  jle vline_3_1
117  sub si,dx
118  loop vline_0_1_nocheckdone
119  jmp vline_done

Tricks used here:

Unrolled 8 cases and arranged them so that they fall through as often as possible and short jump to the other diagonal when necessary.
Put the end of the routine in the middle so that it is a short jump from both sides.
Both DS and ES point to VRAM, avoiding segment overrides.
Used LOOP when we can (only one in every 4 pixels, but still worthwhile).

This is for rightward (\) lines - you'll need to duplicate and rearrange for leftward. Then you'll need to duplicate again for each of the 4 colours, XOR and erase (so I've only written 1/12th of the full verticalish routine here if you implement all those variations). Colours 0 and 3 will be faster, as you've discovered.

Some minor possible optimisations I didn't implement:

It might be better to put 8191 in DX and use self-modifying code to patch an immediate value for the "horizontal move" si adjust case. But this is only worthwhile for longer lines because there are 8 places in CS: to patch.
It might be better to put -8192 in bx and use "mov [di+bx],al" instead of stosb for even scanlines. The "+bx" doesn't involve any extra code bytes so is approximately free. Then you only need to adjust di every other scanline and it's with a one-byte immediate instead of a two-byte immediate.

wbhart wrote:
I think verticalish lines are just inherently slower due to having to write a byte for every pixel in the line.

Yes, I think so too.

Reply 27 of 85, by wbhart

Posted on 2019-08-30, 19:12

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

@reenigne Ah, you probably won't believe I came up with this independently now, but here is what I've been working on this afternoon. It uses basically exactly the same ideas you had. And yes, I think it works, with a few bytes to spare.

1  PUBLIC _cga_draw_line2
2_cga_draw_line2 PROC
3   ARG x0:WORD, y0:WORD, xdiff:WORD, ydiff:WORD, D:WORD, yend:WORD, colour:BYTE
4   ; line from (x0, y0) - (?, yend) including endpoints
5   ; AL: colour, BX: ydelta, CX: Loop, DX: D, SP: 2*dy, BP: 2*dx,
6   ; SI: ??, DI: Offset, DS:B800, ES: B800
7   push bp
8   mov bp, sp
9   push di
10   push si
11   push ds
12
13   mov ax, 0b800h       ; set ES to segment for CGA memory
14   mov es, ax
15   mov ds, ax           ; reflect in DS
16
17   sub dx, bp           ; compensate for first addition of 2*dx - 2*dy
18
19line2_loop1:
20
21   mov al, [di]
22   and al, 03fh
23line2_patch1:
24   or al, 0c0h
25   stosb
26   add dx, bp           ; D += 2*dx - 2*dy
27   jg line2_incx21
28   add dx, sp           ; D += 2*dy
29line2_incx11:
30   add di, 8191
31
32   mov al, [di]
33   and al, 03fh
34line2_patch2:
35   or al, 0c0h
36   stosb
37   add dx, bp           ; D += 2*dx - 2*dy
38   jg line2_incx22
39   add dx, sp           ; D += 2*dy
40line2_incx12:
41   sub di, 8112
42
43   loop line2_loop1
44   jmp line2_no_iter
45   
46
47line2_loop4:
48
49   mov al, [di]
50   and al, 0fch
51line2_patch3:
52   or al, 03h
53   stosb
54   inc di               ; move to next byte, maybe?
55   add dx, bp           ; D += 2*dx - 2*dy
56   jg line2_incx11
57   dec di
58   add dx, sp           ; D += 2*dy
59line2_incx41:
60   add di, 8191

…Show last 81 lines

61
62   mov al, [di]
63   and al, 0fch
64line2_patch4:
65   or al, 03h
66   stosb
67   inc di               ; move to next byte, maybe?
68   add dx, bp           ; D += 2*dx - 2*dy
69   jg line2_incx12
70   dec di
71   add dx, sp           ; D += 2*dy
72line2_incx42:
73   sub di, 8112
74
75   loop line2_loop4
76   jmp line2_no_iter
77
78
79line2_loop2:
80
81   mov al, [di]
82   and al, 0cfh
83line2_patch5:
84   or al, 030h
85   stosb
86   add dx, bp           ; D += 2*dx - 2*dy
87   jg line2_incx31
88   add dx, sp           ; D += 2*dy
89line2_incx21:
90   add di, 8191
91
92   mov al, [di]
93   and al, 0cfh
94line2_patch6:
95   or al, 030h
96   stosb
97   add dx, bp           ; D += 2*dx - 2*dy
98   jg line2_incx32
99   add dx, sp           ; D += 2*dy
100line2_incx22:
101   sub di, 8112
102
103   loop line2_loop2
104   jmp line2_no_iter
105
106
107line2_loop3:
108
109   mov al, [di]
110   and al, 0f3h
111line2_patch3:
112   or al, 0ch
113   stosb
114   add dx, bp           ; D += 2*dx - 2*dy
115   jg line2_incx41
116   add dx, sp           ; D += 2*dy
117line2_incx31:
118   add di, 8191
119
120   mov al, [di]
121   and al, 0f3h
122line2_patch4:
123   or al, 0ch
124   stosb
125   add dx, bp           ; D += 2*dx - 2*dy
126   jg line2_incx42
127   add dx, sp           ; D += 2*dy
128line2_incx32:
129   sub di, 8112
130
131   loop line2_loop3
132
133line2_no_iter:
134
135   pop ds
136   pop si
137   pop di
138   pop bp
139   ret
140_cga_draw_line2 ENDP

It's incomplete, but the ideas are there. I'll try and clean it up tonight and compare more closely with yours, which I've just seen now.

Last edited by wbhart on 2019-08-30, 21:56. Edited 1 time in total.

YouTube Channel - PCRetroTech

Reply 28 of 85, by wbhart

Posted on 2019-08-30, 19:25

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

@reenigne Actually, I think your approach is slightly different. You have the 8 cases in 2 sets of 4, where I have them in 4 sets of 2, I think.

It would actually be interesting to compare the two approaches.

YouTube Channel - PCRetroTech

Reply 29 of 85, by wbhart

Posted on 2019-08-30, 20:11

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

reenigne wrote:

It might be better to put -8192 in bx and use "mov [di+bx],al" instead of stosb for even scanlines. The "+bx" doesn't involve any extra code bytes so is approximately free. Then you only need to adjust di every other scanline and it's with a one-byte immediate instead of a two-byte immediate.

That's a nice trick. There are still two registers free in my implementation (BP and SP in the new version I'm about to commit), so there's plenty of chances to do improvements like this.

YouTube Channel - PCRetroTech

Reply 30 of 85, by wbhart

Posted on 2019-08-30, 22:09

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

The newest version (_cga_draw_line2 in cga5.asm) seems to be more or less correct, modulo some todos. I can now time it.

I do lines from 0,0 to i,199 with i in [0, 198] in increments of 3, i.e. 67 lines. I repeat 60 times. The total time is 17.3s, which works out to about 103 cycles/pixel. Edit: no it's not, it's 19.7s, which is ~117 cycles/pixel.

Last edited by wbhart on 2019-08-31, 12:01. Edited 1 time in total.

YouTube Channel - PCRetroTech

Reply 31 of 85, by wbhart

Posted on 2019-08-31, 09:31

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

wbhart wrote:
reenigne wrote:

It might be better to put -8192 in bx and use "mov [di+bx],al" instead of stosb for even scanlines. The "+bx" doesn't involve any extra code bytes so is approximately free. Then you only need to adjust di every other scanline and it's with a one-byte immediate instead of a two-byte immediate.

That's a nice trick. There are still two registers free in my implementation (BP and SP in the new version I'm about to commit), so there's plenty of chances to do improvements like this.

Unfortunately this doesn't quite work. The address stosb writes to isn't correct if you do this.

if you replace stosb with an explicit move you add 3 bytes (the direction flag must be set and cleared for a move in that direction). There's also an additional two cycles for effective address calculation over a stosb. This doesn't appear to balance the savings elsewhere.

YouTube Channel - PCRetroTech

Reply 32 of 85, by reenigne

Posted on 2019-08-31, 10:31

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote:
Unfortunately this doesn't quite work. The address stosb writes to isn't correct if you do this.

if you replace stosb with an explicit move you add 3 bytes (the direction flag must be set and cleared for a move in that direction). There's also an additional two cycles for effective address calculation over a stosb. This doesn't appear to balance the savings elsewhere.

Hmm... this is what I'm thinking of. Have I missed something?

1  push bp
2  push ds
3  push si
4  push di
5  mov bx,-8192
6  mov ax,0xb800
7  mov es,ax
8  mov ds,ax
9  ; TODO: set up initial bp, dx, si, di, cx
10  ; TODO: jump to appropriate starting routine
11
12
13vline_0_0:
14  dec cx
15  jz vline_done
16vline_0_0_nocheckdone:
17  mov al,[di+bx]
18  and al,0x3f
19  or al,0x80
20  mov [di+bx],al
21  add si,bp
22  jle vline_0_1
23  sub si,dx
24
25vline_1_1:
26  dec cx
27  jz vline_done
28  mov al,[di]
29  and al,0xcf
30  or al,0x20
31  stosb
32  add di,79
33  add si,bp
34  jle vline_1_0
35  sub si,dx
36
37vline_2_0:
38  dec cx
39  jz vline_done
40  mov al,[di+bx]
41  and al,0xf3
42  or al,0x08
43  mov [di+bx],al
44  add si,bp
45  jle vline_2_1
46  sub si,dx
47
48vline_3_1:
49  dec cx
50  jz vline_done
51  mov al,[di]
52  and al,0xfc
53  or al,0x02
54  stosb
55  add di,79
56  add si,bp
57  jle vline_3_0
58  sub si,dx
59  loop vline_0_0_nocheckdone
60

…Show last 55 lines

61vline_done:
62  pop di
63  pop si
64  pop ds
65  pop bp
66  ret
67
68vline_0_1:
69  dec cx
70  jz vline_done
71vline_0_1_nocheckdone:
72  mov al,[di]
73  and al,0x3f
74  or al,0x80
75  stosb
76  add di,79
77  add si,bp
78  jle vline_0_0
79  sub si,dx
80
81vline_1_0:
82  dec cx
83  jz vline_done
84  mov al,[di+bx]
85  and al,0xcf
86  or al,0x20
87  mov [di+bx],al
88  add si,bp
89  jle vline_1_1
90  sub si,dx
91
92vline_2_1:
93  dec cx
94  jz vline_done
95  mov al,[di]
96  and al,0xf3
97  or al,0x08
98  stosb
99  add di,79
100  add si,bp
101  jle vline_2_0
102  sub si,dx
103
104vline_3_0:
105  dec cx
106  jz vline_done
107  mov al,[di+bx]
108  and al,0xfc
109  or al,0x02
110  mov [di+bx],al
111  add si,bp
112  jle vline_3_1
113  sub si,dx
114  loop vline_0_1_nocheckdone
115  jmp vline_done

DI here points to the next byte to write in the second bank (odd scanlines). The stosb happens on the odd scanlines, so then we adjust by 79 afterwards to move down to the next pair of lines. BX is 0xe000 (-8192) so we add that on to the effective address to get to the even scanlines, and there is no adjustment necessary. This saves one bus cycle per pixel on average by my reckoning, so should be a win. Though admittedly I haven't tried applying this to your version of the code - it might be more difficult to get it to work there.

Reply 33 of 85, by wbhart

Posted on 2019-08-31, 11:24

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

Something weird is going on. The timing that was 17.3s last night is 19.7s this morning. I do have a dodgy trimpot in my PC, and my understanding is it changes the clock frequency slightly, to alter the colours in NTSC output. But I don't know whether it could make that much difference. I'm quite puzzled, as I timed it three times last night to make sure.

Anyhow, your trick does work and it is a speedup of about ~0.5s overall. I was simply being misled by the timings being different today than what I reported yesterday. I just checked out the commit from last night to test this.

But note that mov [di+bx], al actually emits three assembly instructions for a total of 4 bytes, instead of one byte for stosb. So it's more expensive than it looks.

YouTube Channel - PCRetroTech

Reply 34 of 85, by reenigne

Posted on 2019-08-31, 11:43

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote:
Something weird is going on. The timing that was 17.3s last night is 19.7s this morning. I do have a dodgy trimpot in my PC, and my understanding is it changes the clock frequency slightly, to alter the colours in NTSC output. But I don't know whether it could make that much difference. I'm quite puzzled, as I timed it three times last night to make sure.

The trimpot won't make any difference at all for a 4.77MHz machine since the crystal it adjusts clocks both the CPU and the PIT that you are using for timing. Speeding both up by the same amount won't affect the reported value. But even if the CPU is clocked by a different crystal the adjustment is tiny, not 10%+. There may be variations in measurement due to the phases of different clocks but again this seems very high for that sort of effect. Could you have been compiling the wrong version of the code, or running an out of date executable?

wbhart wrote:
But note that mov [di+bx], al actually emits three assembly instructions for a total of 4 bytes, instead of one byte for stosb. So it's more expensive than it looks.

"mov [di+bx],al" should be a single two-byte instruction (88 01).

Reply 35 of 85, by wbhart

Posted on 2019-08-31, 11:56

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

reenigne wrote:
wbhart wrote:
Something weird is going on. The timing that was 17.3s last night is 19.7s this morning. I do have a dodgy trimpot in my PC, and my understanding is it changes the clock frequency slightly, to alter the colours in NTSC output. But I don't know whether it could make that much difference. I'm quite puzzled, as I timed it three times last night to make sure.

The trimpot won't make any difference at all for a 4.77MHz machine since the crystal it adjusts clocks both the CPU and the PIT that you are using for timing. Speeding both up by the same amount won't affect the reported value. But even if the CPU is clocked by a different crystal the adjustment is tiny, not 10%+. There may be variations in measurement due to the phases of different clocks but again this seems very high for that sort of effect. Could you have been compiling the wrong version of the code, or running an out of date executable?

Yeah it's not the trimpot, I checked. I'm not timing using the PIT, but just a stopwatch. The timings can be out by +/-0.1s this way, but not 2.4s!

I really don't have any explanation for it. Nothing makes sense, because one sees the output on the screen, so it isn't as if I could be running the wrong executable. It's the first time I had it working at all, so I don't see how it could have been the wrong code. It's mystifying.

reenigne wrote:
wbhart wrote:
But note that mov [di+bx], al actually emits three assembly instructions for a total of 4 bytes, instead of one byte for stosb. So it's more expensive than it looks.

"mov [di+bx],al" should be a single two-byte instruction (88 01).

Oh I'm totally wrong! The 8086 book uses the term "direction flag", which was completely misleading me.

Now I don't have a clue why I had to add 3 bytes to my computed jumps after changing from stosb to this. I must have messed something up somewhere or other.

YouTube Channel - PCRetroTech

Reply 36 of 85, by wbhart

Posted on 2019-08-31, 13:01

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

I implemented your [bx+di] trick again. No problems with byte counting this time. I don't know what I did wrong before.

Anyhow, It is at 18.2s now, or ~108 cycles/pixel.

I don't see any obvious improvements now. I think my version uses one less instruction per pixel on average than yours, but I didn't count bus cycles. If you think it will be faster, I'll try implementing it at some point.

YouTube Channel - PCRetroTech

Reply 37 of 85, by wbhart

Posted on 2019-08-31, 13:37

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

As the verticalish lines have to do precisely two CGA memory accesses per pixel, it seems like it could be a candidate for syncing with the CGA clock to avoid CGA wait states.

What's not clear to me is at what point during execution of a memory access instruction the CPU will make the request on the bus. It seems to be relevant because the number of cycles the memory access instructions take is not the same. Thus whether the wait state occurs at the beginning or end of the instruction will affect the syncronization.

YouTube Channel - PCRetroTech

Reply 38 of 85, by reenigne

Posted on 2019-08-31, 17:15

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 649
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote:
I really don't have any explanation for it. Nothing makes sense, because one sees the output on the screen, so it isn't as if I could be running the wrong executable. It's the first time I had it working at all, so I don't see how it could have been the wrong code. It's mystifying.

Maybe the random number generator that you are using makes longer lines with some seeds than with others (the LCG generators that are normally used for random() do have problems like this sometimes).

wbhart wrote:
Oh I'm totally wrong! The 8086 book uses the term "direction flag", which was completely misleading me.

The direction flag controls whether DI is incremented or decremented by STOSB (and similarly for other string instructions). Normally it's left clear so that DI is incremented.

wbhart wrote:
I don't see any obvious improvements now. I think my version uses one less instruction per pixel on average than yours,

I was sceptical but your trick of only checking the iteration count every other scanline and fixing up the last pixel at the end could very well do it - impressive!

wbhart wrote:
As the verticalish lines have to do precisely two CGA memory accesses per pixel, it seems like it could be a candidate for syncing with the CGA clock to avoid CGA wait states.

Perhaps! Though the programmer has very little control over this. About all you can do, I think, is to try rearranging things and see if it makes it faster. And because the routine is already quite highly constrained, about the only rearrangement that I can see is moving the "add dx,bp" lines above the "mov [bx+di],al" lines, the "stosb" lines, or both.

wbhart wrote:
What's not clear to me is at what point during execution of a memory access instruction the CPU will make the request on the bus. It seems to be relevant because the number of cycles the memory access instructions take is not the same. Thus whether the wait state occurs at the beginning or end of the instruction will affect the syncronization.

I can make some cycle-by-cycle traces if you like. But I'm not sure how enlightening they will be. The cycle that you consider the "start" of an instruction or the start of a bus cycle is a bit arbitrary. And the rules are annoyingly complex, and sometimes get broken by DRAM refresh DMAs anyway.

Reply 39 of 85, by wbhart

Posted on 2019-08-31, 17:50

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

reenigne wrote:
wbhart wrote:
I really don't have any explanation for it. Nothing makes sense, because one sees the output on the screen, so it isn't as if I could be running the wrong executable. It's the first time I had it working at all, so I don't see how it could have been the wrong code. It's mystifying.

Maybe the random number generator that you are using makes longer lines with some seeds than with others (the LCG generators that are normally used for random() do have problems like this sometimes).

I'm not drawing random lines, unfortunately. I am still mystified by this. I've been over all the possibilities I can think of and still don't see how it's possible. The best I can come up with is that I timed it at 19.7s three times in a row, noted with satisfaction that this is faster than the horizontalish lines (in absolute time), which it isn't, but only because the verticalish lines are 200 pixels and the horizontalish ones 320 long, then forgot the number I measured three times while I walked down the hallway to type it into this forum. I can just about exclude every other possibility! I am really genuinely mystified. I mean, I checked it three times!

reenigne wrote:
wbhart wrote:
Oh I'm totally wrong! The 8086 book uses the term "direction flag", which was completely misleading me.

The direction flag controls whether DI is incremented or decremented by STOSB (and similarly for other string instructions). Normally it's left clear so that DI is incremented.

Sure, which is why it totally shouldn't be used to describe mov mem/reg, mem/reg. It had me believing that reg would be on the left if the direction flag was 0 and on the right if 1. But that is not what they mean at all. There is a bit in the instruction encoding itself that they are calling the direction flag. It actually says, "d is the direction flag. If d = 0...." To compound things, I was unable to figure out why the number of bytes in my computed jump was off by two. I eventually decided that it must be because the assembler emits a std before the mov and a cli after the mov. Of course this is totally illogical, but it fit all the evidence I had at the time. I should have checked more carefully, e.g. by disassembly.

reenigne wrote:
wbhart wrote:
As the verticalish lines have to do precisely two CGA memory accesses per pixel, it seems like it could be a candidate for syncing with the CGA clock to avoid CGA wait states.

Perhaps! Though the programmer has very little control over this. About all you can do, I think, is to try rearranging things and see if it makes it faster. And because the routine is already quite highly constrained, about the only rearrangement that I can see is moving the "add dx,bp" lines above the "mov [bx+di],al" lines, the "stosb" lines, or both.

Yes, I will certainly try some superoptimisation. It didn't seem to affect the horizontalish lines at all and aligning the loop targets to 16 bit boundaries also does nothing for horizontalish or verticalish lines. Presumably the prefetch is already full and this isn't a problem at this point, even with the additional bus access.

I note that the number of bus accesses is probably much lower than the number of cpu cycles required for the instructions in this case. I'm also not sure I understand when I should add bus access times to the instruction timings. Is that only if there is a holdup due to too many bus accesses? Or does the CPU always incur these costs regardless? If the latter, then I am having trouble figuring out the timings, as it seems to be entirely accounted for by CPU instruction timings! We are using some pretty hefty instructions.

reenigne wrote:
wbhart wrote:
What's not clear to me is at what point during execution of a memory access instruction the CPU will make the request on the bus. It seems to be relevant because the number of cycles the memory access instructions take is not the same. Thus whether the wait state occurs at the beginning or end of the instruction will affect the syncronization.

I can make some cycle-by-cycle traces if you like. But I'm not sure how enlightening they will be. The cycle that you consider the "start" of an instruction or the start of a bus cycle is a bit arbitrary. And the rules are annoyingly complex, and sometimes get broken by DRAM refresh DMAs anyway.

Logic dictates it must be towards the end of the instruction, as EA needs to be computed and so on. But it does seem like writes can take a cycle less than reads from memory, if the mov mem/reg, mem/reg timings are anything to go by.

I was thinking that a 5/5/6 cycle cadence would be optimal for CGA accesses based on the wait state information in your blog. That seems to imply that if the code is timed to request data from the CGA memory after 5, 10 or 15 cycles, that would be optimal, given that we cannot insert an extra cycle every 16 cycles (and DMA refresh probably ruins optimality anyway).

By the way, I've seen people talking about turning off DMA refresh. But how does one's program code and data get refreshed if you do this? Or is it possible to selectively turn off DMA refresh for certain segments? (Not that this is a good piece of code to do this in. I am just curious.)

YouTube Channel - PCRetroTech

Main menu

Topic actions

Reply 20 of 85, by wbhart

Reply 21 of 85, by wbhart

Reply 22 of 85, by pan069

Reply 23 of 85, by wbhart

Reply 24 of 85, by reenigne

Reply 25 of 85, by wbhart

Reply 26 of 85, by reenigne

Reply 27 of 85, by wbhart

Reply 28 of 85, by wbhart

Reply 29 of 85, by wbhart

Reply 30 of 85, by wbhart

Reply 31 of 85, by wbhart

Reply 32 of 85, by reenigne

Reply 33 of 85, by wbhart

Reply 34 of 85, by reenigne

Reply 35 of 85, by wbhart

Reply 36 of 85, by wbhart

Reply 37 of 85, by wbhart

Reply 38 of 85, by reenigne

Reply 39 of 85, by wbhart