VOGONS

Common searches


First post, by PlaneVuki

User metadata
Rank Member
Rank
Member

Hi!

My knowledge of assembly is weak-to-medium.

This started as simple image downscale attempt, but I will ask in general.

I am trying to copy data from one location to another, but neither source nor destination are consecutive. Like this:

asm1.png
Filename
asm1.png
File size
13.51 KiB
Views
703 views
File license
Public domain

My code is this:

************************
mov cx,4D ;77 is 4D in hex
mov ax, destination_segment
mov es,ax
mov di,destination_offset
mov ax, source_segment
mov ds,ax
mov bx,source_offset

dostuff:
mov al,[bx]
mov [ es:di],al
add bx,3
add di,5
loop dostuff
************************

Is this the fastest method to do the copying (not considering loop unrolling)?
What faster way exist? What other improvements can be done?
If the destination is consecutive, I can do (inc di) instead of (add di,5), any other improvement available in this case?

Thanks in advance.

Reply 1 of 12, by Disruptor

User metadata
Rank Oldbie
Rank
Oldbie

Why don't use movsb?

************************
mov cx,4D ;77 is 4D in hex
mov ax, destination_segment
mov es,ax
mov di,destination_offset
mov ax, source_segment
mov ds,ax
mov si,source_offset

dostuff:
movsb
add si,2
add di,4
loop dostuff
************************

Reply 2 of 12, by Deano

User metadata
Rank Newbie
Rank
Newbie

As Disruptor said uses the string instruction with rep prefix where posible, movsw is effectively
Loop:
mov ax, ds[si],
mov es[di],ax
add si, 2
add di, 2
dec cx
jnz loop

The main blocker is if you need add to the index register that are not 1 or 2 (b or w postfix). In that case its better if possible to have one (source or dest) packed and then add the extra to the other index
Loop:
movsw
add di , 3
loop Loop

Last edited by Deano on 2024-01-24, 11:22. Edited 1 time in total.

Game dev since last century

Reply 3 of 12, by Deano

User metadata
Rank Newbie
Rank
Newbie

Even on 8088 its best to use word versus byte if you can, as it saves instruction decode time by just doing half as many (obviously the 8088 can only load/store 8 bits at a time but every saving is worth it).

Game dev since last century

Reply 5 of 12, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

just put a CLD in front of your look if you use MOVSx to remove any chance you'd go backwards!

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 6 of 12, by Deano

User metadata
Rank Newbie
Rank
Newbie
Disruptor wrote on 2024-01-24, 11:58:

With word copying you would overwrite memory, which seems not to be intended by PlaneVuki.

Yes of course but he also mentioned changing the destination (di to a single inc), in which case possibly could also set it up to word writes? if not bytes will work of course.

Game dev since last century

Reply 7 of 12, by Disruptor

User metadata
Rank Oldbie
Rank
Oldbie

When he tries to copy non-consecutive source to consecutive destination, he does not need an increment on di at all, because movsb does.
I don't know what improvement you'll get with word writes. It most likely will have unwanted side effects.
Optimizing code is different to each CPU type. Basically on an 8088 itself you prefer smaller code because filling the prefetch queue costs so many clocks. In modern CPUs write combining will help.

Reply 8 of 12, by mkarcher

User metadata
Rank l33t
Rank
l33t
Disruptor wrote on 2024-01-24, 10:50:
Why don't use movsb? […]
Show full quote

Why don't use movsb?

************************
mov cx,4D ;77 is 4D in hex
mov ax, destination_segment
mov es,ax
mov di,destination_offset
mov ax, source_segment
mov ds,ax
mov si,source_offset

dostuff:
movsb
add si,2
add di,4
loop dostuff
************************

This looks like the best solution so far that meets the constraint to only write single bytes into the output range. You can still improve slightly. As already pointed out in this thread, th main bottleneck of the 8088 is fetching instructions. The ADD instruction in this suggestion take 3 bytes each (One byte opcode, one MOD/RM byte, one byte for the sign-extended immediate). You can get down to two bytes per ADD instruction by loading 2 and 4 into registers before entering the loop. This will make the loop 2 bytes shorter, which should improve performance:

mov cx,4D ;77 is 4D in hex
mov ax, destination_segment
mov es,ax
mov di,destination_offset
mov ax, source_segment
mov ds,ax
mov si,source_offset
mov ax,2
mov bx,4

dostuff:
movsb
add si,ax
add di,bx
loop dostuff

This code is obviously longer than the code proposed by Disruptor (2 bytes less in the loop body, 6 bytes extra in the loop setup), so it's only worth it if the loop is executed like 5 or more times. 77 times should definitely suffice to make this suggestion worth it if performance is more important than total code size.

Reply 9 of 12, by PlaneVuki

User metadata
Rank Member
Rank
Member

Thanks you all!

A few follw-up questions:

1) What improvements can be made if I was aimimg for V20, 286 or 386sx (cacheless motherboard)? (but still real-mode and no 32-bit code)
2) Does ram response times make difference to what code performs fastest?

Reply 10 of 12, by Deano

User metadata
Rank Newbie
Rank
Newbie

The instruction prefetch becomes less important as you go up the CPU lines.

IIRC rep movsb/w is basically the fastest across them all (and even higher CPUs). 16 bit moves will become even more desirable.

Can't think of any of the extra instructions that would help in this case...

Game dev since last century

Reply 11 of 12, by FreddyV

User metadata
Rank Oldbie
Rank
Oldbie
PlaneVuki wrote on 2024-01-25, 08:47:
Thanks you all! […]
Show full quote

Thanks you all!

A few follw-up questions:

1) What improvements can be made if I was aimimg for V20, 286 or 386sx (cacheless motherboard)? (but still real-mode and no 32-bit code)
2) Does ram response times make difference to what code performs fastest?

Hi,
You can try with and without a nop before the copy loop, code alignment help on 8086 and will not slow down 8088.

No V20 instruction can help there

Reply 12 of 12, by GloriousCow

User metadata
Rank Member
Rank
Member
Deano wrote on 2024-01-24, 11:05:

Even on 8088 its best to use word versus byte if you can, as it saves instruction decode time by just doing half as many (obviously the 8088 can only load/store 8 bits at a time but every saving is worth it).

Not to nitpick, but rep movsb is only decoded once (assuming no interrupts occur). The REP prefix sets an internal flag which the microcode of movs uses as a branch condition to enable looping. movsw is still faster than movsb on 8088 by a margin larger than you'd expect with the 8-bit bus limitation, but the details are more to do with decreased microcode loop overhead (decrementing cx and jumping half as much) as well as improved bus pipelining of word transfers.

EDIT: did some measurements. Ignoring DMA, movsw is about 36% faster than movsb on 8088

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc