8088 assembly help \ VOGONS

8088 assembly help

Topic actions

First post, by PlaneVuki

Posted on 2024-01-24, 10:25

PlaneVuki Offline

Rank Member

Rank: Member
Posts: 160
Joined: 2020-03-28, 14:34

Hi!

My knowledge of assembly is weak-to-medium.

This started as simple image downscale attempt, but I will ask in general.

I am trying to copy data from one location to another, but neither source nor destination are consecutive. Like this:

The attachment asm1.png is no longer available

My code is this:

************************
mov cx,4D ;77 is 4D in hex
mov ax, destination_segment
mov es,ax
mov di,destination_offset
mov ax, source_segment
mov ds,ax
mov bx,source_offset

dostuff:
mov al,[bx]
mov [ es:di],al
add bx,3
add di,5
loop dostuff
************************

Is this the fastest method to do the copying (not considering loop unrolling)?
What faster way exist? What other improvements can be done?
If the destination is consecutive, I can do (inc di) instead of (add di,5), any other improvement available in this case?

Thanks in advance.

Reply 1 of 12, by Disruptor

Posted on 2024-01-24, 10:50

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1931
Joined: 2018-03-22, 18:31
Location: European Union

Why don't use movsb?

************************
mov cx,4D ;77 is 4D in hex
mov ax, destination_segment
mov es,ax
mov di,destination_offset
mov ax, source_segment
mov ds,ax
mov si,source_offset

dostuff:
movsb
add si,2
add di,4
loop dostuff
************************

Reply 2 of 12, by Deano

Posted on 2024-01-24, 11:03

Deano Offline

Rank Newbie

Rank: Newbie
Posts: 90
Joined: 2023-11-08, 15:59

As Disruptor said uses the string instruction with rep prefix where posible, movsw is effectively
Loop:
mov ax, ds[si],
mov es[di],ax
add si, 2
add di, 2
dec cx
jnz loop

The main blocker is if you need add to the index register that are not 1 or 2 (b or w postfix). In that case its better if possible to have one (source or dest) packed and then add the extra to the other index
Loop:
movsw
add di , 3
loop Loop

Last edited by Deano on 2024-01-24, 11:22. Edited 1 time in total.

Game dev since last century

Reply 3 of 12, by Deano

Posted on 2024-01-24, 11:05

Deano Offline

Rank Newbie

Rank: Newbie
Posts: 90
Joined: 2023-11-08, 15:59

Even on 8088 its best to use word versus byte if you can, as it saves instruction decode time by just doing half as many (obviously the 8088 can only load/store 8 bits at a time but every saving is worth it).

Game dev since last century

Reply 4 of 12, by Disruptor

Posted on 2024-01-24, 11:58

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1931
Joined: 2018-03-22, 18:31
Location: European Union

With word copying you would overwrite memory, which seems not to be intended by PlaneVuki.

Reply 5 of 12, by BloodyCactus

Posted on 2024-01-24, 13:29

BloodyCactus Offline

Rank Oldbie

Rank: Oldbie
Posts: 1588
Joined: 2016-02-03, 13:34
Location: Lexington VA

just put a CLD in front of your look if you use MOVSx to remove any chance you'd go backwards!

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 6 of 12, by Deano

Posted on 2024-01-24, 16:28

Deano Offline

Rank Newbie

Rank: Newbie
Posts: 90
Joined: 2023-11-08, 15:59

Disruptor wrote on 2024-01-24, 11:58:

With word copying you would overwrite memory, which seems not to be intended by PlaneVuki.

Yes of course but he also mentioned changing the destination (di to a single inc), in which case possibly could also set it up to word writes? if not bytes will work of course.

Game dev since last century

Reply 7 of 12, by Disruptor

Posted on 2024-01-24, 17:05

Disruptor Offline

Rank Oldbie

Rank: Oldbie
Posts: 1931
Joined: 2018-03-22, 18:31
Location: European Union

When he tries to copy non-consecutive source to consecutive destination, he does not need an increment on di at all, because movsb does.
I don't know what improvement you'll get with word writes. It most likely will have unwanted side effects.
Optimizing code is different to each CPU type. Basically on an 8088 itself you prefer smaller code because filling the prefetch queue costs so many clocks. In modern CPUs write combining will help.

Reply 8 of 12, by mkarcher

Posted on 2024-01-24, 21:25

mkarcher Offline

Rank l33t

Rank: l33t
Posts: 3814
Joined: 2019-01-19, 16:29
Location: Germany

Disruptor wrote on 2024-01-24, 10:50:
Why don't use movsb? […]
Show full quote

Why don't use movsb?

************************
mov cx,4D ;77 is 4D in hex
mov ax, destination_segment
mov es,ax
mov di,destination_offset
mov ax, source_segment
mov ds,ax
mov si,source_offset

dostuff:
movsb
add si,2
add di,4
loop dostuff
************************

This looks like the best solution so far that meets the constraint to only write single bytes into the output range. You can still improve slightly. As already pointed out in this thread, th main bottleneck of the 8088 is fetching instructions. The ADD instruction in this suggestion take 3 bytes each (One byte opcode, one MOD/RM byte, one byte for the sign-extended immediate). You can get down to two bytes per ADD instruction by loading 2 and 4 into registers before entering the loop. This will make the loop 2 bytes shorter, which should improve performance:

1mov cx,4D ;77 is 4D in hex
2mov ax, destination_segment
3mov es,ax
4mov di,destination_offset
5mov ax, source_segment
6mov ds,ax
7mov si,source_offset
8mov ax,2
9mov bx,4
10
11dostuff:
12movsb
13add si,ax
14add di,bx
15loop dostuff

This code is obviously longer than the code proposed by Disruptor (2 bytes less in the loop body, 6 bytes extra in the loop setup), so it's only worth it if the loop is executed like 5 or more times. 77 times should definitely suffice to make this suggestion worth it if performance is more important than total code size.

Reply 9 of 12, by PlaneVuki

Posted on 2024-01-25, 08:47

PlaneVuki Offline

Rank Member

Rank: Member
Posts: 160
Joined: 2020-03-28, 14:34

Thanks you all!

A few follw-up questions:

1) What improvements can be made if I was aimimg for V20, 286 or 386sx (cacheless motherboard)? (but still real-mode and no 32-bit code)
2) Does ram response times make difference to what code performs fastest?

Reply 10 of 12, by Deano

Posted on 2024-01-25, 09:11

Deano Offline

Rank Newbie

Rank: Newbie
Posts: 90
Joined: 2023-11-08, 15:59

The instruction prefetch becomes less important as you go up the CPU lines.

IIRC rep movsb/w is basically the fastest across them all (and even higher CPUs). 16 bit moves will become even more desirable.

Can't think of any of the extra instructions that would help in this case...

Game dev since last century

Reply 11 of 12, by FreddyV

Posted on 2024-01-25, 09:28

FreddyV Offline

Rank Oldbie

Rank: Oldbie
Posts: 874
Joined: 2019-04-08, 11:58

PlaneVuki wrote on 2024-01-25, 08:47:
Thanks you all! […]
Show full quote

Thanks you all!

A few follw-up questions:

1) What improvements can be made if I was aimimg for V20, 286 or 386sx (cacheless motherboard)? (but still real-mode and no 32-bit code)
2) Does ram response times make difference to what code performs fastest?

Hi,
You can try with and without a nop before the copy loop, code alignment help on 8086 and will not slow down 8088.

No V20 instruction can help there

Reply 12 of 12, by GloriousCow

Posted on 2024-02-09, 22:20

GloriousCow Offline

Rank Oldbie

Rank: Oldbie
Posts: 576
Joined: 2022-09-12, 20:00

Deano wrote on 2024-01-24, 11:05:

Even on 8088 its best to use word versus byte if you can, as it saves instruction decode time by just doing half as many (obviously the 8088 can only load/store 8 bits at a time but every saving is worth it).

Not to nitpick, but rep movsb is only decoded once (assuming no interrupts occur). The REP prefix sets an internal flag which the microcode of movs uses as a branch condition to enable looping. movsw is still faster than movsb on 8088 by a margin larger than you'd expect with the 8-bit bus limitation, but the details are more to do with decreased microcode loop overhead (decrementing cx and jumping half as much) as well as improved bus pipelining of word transfers.

EDIT: did some measurements. Ignoring DMA, movsw is about 36% faster than movsb on 8088

MartyPC: A cycle-accurate IBM PC/XT emulator | https://github.com/dbalsom/martypc

Go to top of page Go to top of page

Back to Milliways