VOGONS


SVGA/VESA programming

Topic actions

Reply 20 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2022-06-20, 16:17:
Hi, Actually selecting banks/windows is more simple than you think. You should not calculate addresses all the time. There is V […]
Show full quote
WhiteFalcon wrote on 2022-06-20, 11:03:

Neat! Thank you for the tips and examples, I will try it on my real PC when I get the chance again. And also here in DOSBox. As you apparently have experiences with this, how do you go about using it for double buffering? I mean you probably set the start offset from 0 to 307200 and back (for 640x480) and it should be possible to use and alternation of the method I mentioned in the first post, storing the background and restoring after the flip. But how about the banks, does the second screen start on a bank boundary, i.e. 307200? Or do you need to just consider it all as one piece of memory which will make the bank calculation for PutImage/Sprite a bit more complicated? I am somewhat afraid the latter is the case, which will again make it a bit over my head. Nevertheless it would make it completely flicker free.

Hi,
Actually selecting banks/windows is more simple than you think. You should not calculate addresses all the time. There is VESA function 05h Display Window Control. Notice that 'Window' is a different concept to multiple display pages. It works only in Banked video modes to overcome the 64K limit of the real mode video segment. This function has to be used in every mode where the mode requires more than 64K to present the full resolution regardless virtual pages are used or not.
Back to the point: VESA function 05h needs to be used only in the lowest level in your Putpixel (or similar routine). Namely:

 
mov ax, 0a000h
mov es, ax ; es segment register stores video segment all the time.
...
...
PUTPIXEL PROC ; on enter dx contains original Window number (0 at start)
push dx
mov ax, word ptr[Y_coordinate]
mul word ptr [modeinfo.BytesPerScanLine] ; scanline in bytes. If you use Function 06h to set a different virtual width then you should use the value returned in BX instead.
mov di, ax
add di, word ptr[X_coordinate]
adc dx,0
pop cx
cmp dx,cx
je @NOWINDOWCHANGE
mov ax,4f05h ;Window number = dx and Window number is already in dx !
xor bx, bx ;bh=function(0) bl=Window A(0)
int 10h
@NOWINDOWCHANGE:
mov al, byte ptr[Color]
stosb ;Store al at address ES:DI
ret
PUTPIXEL ENDP

The above code should work regardless you set the virtual display width to 2x the screen width for double buffering or you use 2x the screen height (2x scan lines).
BTW there is a description in VBE 3.0. documentation how triple buffering works. The idea is the same for double buffering it is just more simple:

Using Hardware Triple Buffering Hardware triple buffering is supported in the VBE/Core specification by allowing the application […]
Show full quote

Using Hardware Triple Buffering
Hardware triple buffering is supported in the VBE/Core specification by allowing the application
to schedule a display start address change, and then providing a function that can be called to
determine the status of the last scheduled display start address change. VBE Function 4F07h is
used for this, along with subfunctions 02h and 04h. To implement hardware triple buffering you
would use the following steps:
1. Display from the first visible buffer, and render to the second hidden buffer.
2. Schedule a display start address change for the second hidden buffer that you just
finished rendering (4F07h, 02h), and start rendering immediately to the third hidden
buffer. CRT controller is currently displaying from the first buffer.
3. Before scheduling the display start address change for the third hidden buffer, wait
until the last scheduled change has occurred (4F07h, 04h returns not 0). Schedule
display start address change for the third hidden buffer and immediately begin
rendering to the first hidden buffer. CRT controller is currently displaying from the
second buffer.
4. Repeat step 3 over and over cycling though each of the buffers.

The point is: for 640x480 double buffering you can either define an 1280x480 virtual screen with the help of Function 06h or use a 640x960 virtual screen.
In the first case your page 1 is from x=0 and your page 2 is from x=640.
In the second case your page 1 is from y=0 and your page 2 is from y=480.
In both cases you can use Function 07h to switch between page 1 and page 2.
While your page 1 is visible you draw on page 2 and vice versa.

Yes, I have been using function 05 this way for PutPixel, which is similar to yours:

void VESA_PutPixel640(long int x, long int y, unsigned char col)
{
asm {
mov eax, [y]
mov ebx, [y]
shl eax, 7
shl ebx, 9
add eax, ebx
mov ebx, [x]
add eax, ebx
mov cl, 0
}
find_bank:
asm {
cmp eax, 0xFFFF
jle found_bank
sub eax, 0x10000
inc cl
jmp find_bank
}
found_bank:
asm {
cmp cl, [vesa_cur_bank]
je same_bank

push eax
mov ah, 0x4F
mov al, 0x05
mov bh, 0x00
mov bl, 0x00
xor dx, dx
mov dl, cl
int 0x10
inc [vesa_cur_bank]
pop eax
}
same_bank:
asm {
mov di, ax
mov ax, 0xA000
mov es, ax
mov cl, [col]
mov es:[di], cl
}
}

I have a PutPixel separately for 640x, 800x and 1024x (not that I would ever need those) just to avoid that costy "mul" and split it into "shl"s, which reqires a constant number to work with.
Later I realized it actually should not be that difficult to use a larger virtual screen, just couldnt wrap my head around the organization of the memory in that case.
PutPixel is actually not something I will use as I need blitting routines for image/transparent sprite instead, to avoid calculating the starting coordinates of every pixel. I have such routines already, calcualate only the top-left corner of the image and then keep moving right by one and adding a skip after every line to continue on a new one, keeping a running offset. Its much faster than using a PutPixel routine.

There are two more things puzzling me:
1) the memory is linear so if the first approach you mention is used, the actual whole line is twice as long as the displayed screen, the other half hidden "to the right" until you use function 07 to switch the display to it. Meaning when you are displaying the first screen, the first line is 0-639 in vram and the second one is NOT 640-1279, but 1280-1919 (still in vram), correct? Hence the gfx card must compose the whole screen of separate 640-byte pieces of its vram. The PutImage routine would have to use modeinfo.BytesPerScanLine to count the next line.
The second approach, lets say 640x960, sounds easier to me as the scanlines are the same as the horizontal resolution (640) and my mind can grasp that easier 😀 BUT, the first screen finishes somewhere in the middle of the fifth bank, can you use function 07 to start the second screen in the middle of a bank, byte-precise? Or do you need to "sacrifice" the rest of the fifth bank and put bank six in the top-left corner? The granularity of bank changing on my S3 Trio is 64K, that much I checked.
2) my head has enough with double buffering, let alone tripple buffering 😁 I have alway used double buffering in the way that I had one buffer that I cleared every frame, drew everything plus sprites on top and copied it over to 0xA000. That was possible in mode VGA 0x13. This is different as you just switch, is it necessary to use that store background-draw-restore background approach? If so, it would mean I have to draw every change in the background twice, to each buffer? Uff. While sprites on top (the cursor) only once? I can imagine syncing this process could be prone to mistakes. Redrawing everything in the other, not displayed buffer would be too slow and out of the question?

Thank you for your time and for the sample program, I will read through it but probably wont be able to compile it my BorlandC is very strong-headed when it comes to inline asm. And I dont kno how to compile a pure asm file.

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 21 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie

Wow, I dont even know many of the asm commands you use there, never seen floating point maths in asm before! And the maths part of it is way over my head, but I can see how you draw to the four screens as parts of the large virtual one and then switch to them, this is helpful. Also how you can just sit down and write a program in pure asm is beyond me 😀 I am still a noob to asm, just using the most basic commands. Its very interesting, do you too still enjoy programming for DOS? Or you just remember it all so vividly you can put out an asm program just like that?

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 22 of 59, by Falcosoft

User metadata
Rank Oldbie
Rank
Oldbie
WhiteFalcon wrote on 2022-06-21, 06:05:

I have a PutPixel separately for 640x, 800x and 1024x (not that I would ever need those) just to avoid that costy "mul" and split it into "shl"s, which reqires a constant number to work with.

Hi,
I have the strong feeling that your bank finding algorithm is more costly than that simple mul. That mul has the advantage that it gives you the right bank number immediately in dx (in terms of y coordinate/scanline) without using really costly loops and branches.
Mul behaves like a machine code that is specially optimized for this task: You give it a linear address of a line start and it gives you back the exact bank number and offset that line start belongs. Actually in case of 1024x modes you do not have to do anything else to get the proper bank since 1024 *64 is exactly at the segment border.

WhiteFalcon wrote on 2022-06-21, 06:05:

....
Later I realized it actually should not be that difficult to use a larger virtual screen, just couldnt wrap my head around the organization of the memory in that case.

The memory is organized the same way as if your virtual resolution would be a real resolution. That is you should treat it as a real 1280x480 or 640x960 resolution. It's just that simple.

There are two more things puzzling me:
1) the memory is linear so if the first approach you mention is used, the actual whole line is twice as long as the displayed screen, the other half hidden "to the right" until you use function 07 to switch the display to it. Meaning when you are displaying the first screen, the first line is 0-639 in vram and the second one is NOT 640-1279, but 1280-1919 (still in vram), correct?

As I have written above the memory organization is just the same as in case of a real resolution with the same dimensions.
So when you are displaying the first page it is from 0-639px and the second page is from 640-1279px. And your second Line on your first page is also 0-639px only difference is that your y coordinate is 1 and not 0 as in case of the first line. There is nothing like 1280-1919 in terms of pixels in this scenario. But it is true that the second line starts at byte offset 1280 in video memory.
Look at the vesaman example from my previous post. In the zip you can also find a compiled com file so you can test it without compiling the code.

Hence the gfx card must compose the whole screen of separate 640-byte pieces of its vram. The PutImage routine would have to use modeinfo.BytesPerScanLine to count the next line.

It's the same thing as if you would display 2 separate 640x480 bitmaps side by side on an 1280x480 real display. You should teat the situation the exact same way.

The second approach, lets say 640x960, sounds easier to me as the scanlines are the same as the horizontal resolution (640) and my mind can grasp that easier 😀 BUT, the first screen finishes somewhere in the middle of the fifth bank, can you use function 07 to start the second screen in the middle of a bank, byte-precise? Or do you need to "sacrifice" the rest of the fifth bank and put bank six in the top-left corner? The granularity of bank changing on my S3 Trio is 64K, that much I checked.

I have the feeling that you over complicate things here. It works the same way as in a real 640x960 resolution. The Putpixel routine has already handled calculating the proper bank/window. The virtual area beyond the 479th scanline is a simple continuation of the previous. You do not sacrifice anything, the next pixel should be calculated the same way. As I have written the memory is organized just like 640x960 would be a real resolution. And yes, you can use Function 07 to start displaying 'middle bank'. Actually you can start displaying at any pixel/offset! (e.g. x =647; y =489). Virtual scrolling works this way: It modifies the offset in one direction by only 1 pixel.

2) my head has enough with double buffering, let alone tripple buffering 😁 I have alway used double buffering in the way that I had one buffer that I cleared every frame, drew everything plus sprites on top and copied it over to 0xA000. That was possible in mode VGA 0x13. This is different as you just switch, is it necessary to use that store background-draw-restore background approach? If so, it would mean I have to draw every change in the background twice, to each buffer? Uff. While sprites on top (the cursor) only once? I can imagine syncing this process could be prone to mistakes. Redrawing everything in the other, not displayed buffer would be too slow and out of the question?

The whole point of this hardware assisted double buffering is that you do not have to use expensive memory copies. That means that you should do everything the same way as previously but you do not have to copy your buffer to 0xA000 since you are already "there". More precisely with the help of Function 07 you can place any of your buffers there without copying anything, by just switching 'pointers'. You are working directly with video memory all the time. The difference is just that your 2nd/back buffer is also in video memory and not in system memory.

Thank you for your time and for the sample program, I will read through it but probably wont be able to compile it my BorlandC is very strong-headed when it comes to inline asm. And I dont kno how to compile a pure asm file.

You do not have to compile it, there is a compiled com in the zip file. Just read the code and look at what the program does. The floating-point code part is irrelevant. It's there because I have had this Mandelbrot code from the 90's so I could reuse it. But yes, I enjoy writing x86 assembly for DOS and I even write assembly code for Win32/64 where it makes any sense.

Last edited by Falcosoft on 2022-06-21, 08:48. Edited 3 times in total.

Website, Facebook, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper

Reply 23 of 59, by root42

User metadata
Rank l33t
Rank
l33t

Nice code! Let me try to understand it:

PUTPIXEL PROC
push dx

So the main magic is in this function. It takes the pixel position in variables loop1 (Y-coord) and loop2 (X-coord). As we mangle DX we need to save it. At the moment it is not clear what is in DX to me.

            mov ax,1280
mul word ptr[loop1]
mov di,ax
add di,word ptr[loop2]

Here we compute the absolute pixel address my multiplying the Y coord with the width of the virtual screen (1280), loading the result into DI and adding the X coord. This gives us a linear framebuffer coordinate on the card in DX:AX, I think?

            adc dx,0

This looks rather clever and probably just "rescues" the carry from the last addition into DX...? I do not fully understand why we do this though.

            pop cx
cmp dx,cx
je @NOPAGING

Ok, now we take the stored DX value and load it into CX -- it would really help to have some documentation on the parameters of this proc!!! 😁 The comparison yields if we need to page or not, i.e. if the card is currently banking in the correct space of VRAM.

            push bx
mov ax,4f05h ;Window number=dx
mov bx,00h ;bh=function(0) bl=window(0)
int 10h
pop bx

Easy enough, we set the window in the VRAM. So now we at least know that DX currently contains the window number, which is the linear address shr 16! Because it was the upper part of the multiplication! Hah! I knew it! 😁

   @NOPAGING:
add bl,byte ptr[color]
mov es:[di],bl
ret
PUTPIXEL ENDP

The rest is straightforward just setting the pixel. However why add and not mov for bl, I do not understand...

YouTube and Bonus
80486DX@33 MHz, 16 MiB RAM, Tseng ET4000 1 MiB, SnarkBarker & GUSar Lite, PC MIDI Card+X2+SC55+MT32, OSSC

Reply 24 of 59, by Falcosoft

User metadata
Rank Oldbie
Rank
Oldbie
root42 wrote on 2022-06-21, 08:26:

So the main magic is in this function. It takes the pixel position in variables loop1 (Y-coord) and loop2 (X-coord). As we mangle DX we need to save it. At the moment it is not clear what is in DX to me.

DX register always contains the actual bank/window number.

Here we compute the absolute pixel address my multiplying the Y coord with the width of the virtual screen (1280), loading the r […]
Show full quote

Here we compute the absolute pixel address my multiplying the Y coord with the width of the virtual screen (1280), loading the result into DI and adding the X coord. This gives us a linear framebuffer coordinate on the card in DX:AX, I think?

            adc dx,0

This looks rather clever and probably just "rescues" the carry from the last addition into DX...? I do not fully understand why we do this though.

You have to do this in 1280px wide mode since video segment can end at the middle of a line. In case of 1024px wide modes you could omit this since 1024*64 is exactly at the segment border.
A real word example: We are in 1280x480 mode. Let's pretend we would like to draw line 51. 51th line starts at address 51*1280 = 65280. Mul would give us in dx the result that we are still in 1st bank(0).
But we would like to draw the last pixel (1279) in the 51th line: 65280+1279 = 66559 that is outside of the 1st bank. The add di, x machine code is clever enough that when it overflows it sets the carry flag. So it gives us the correct offset in di register (1023) and the adc dx, 0 code part corrects the actual bank number to 2nd bank(1) by adding the carry flag alone to dx register (that always contains the correct bank number) .

Easy enough, we set the window in the VRAM. So now we at least know that DX currently contains the window number, which is the l […]
Show full quote

Easy enough, we set the window in the VRAM. So now we at least know that DX currently contains the window number, which is the linear address shr 16! Because it was the upper part of the multiplication! Hah! I knew it! 😁

   @NOPAGING:
add bl,byte ptr[color]
mov es:[di],bl
ret
PUTPIXEL ENDP

The rest is straightforward just setting the pixel. However why add and not mov for bl, I do not understand...

Yes, you solved the rest of the puzzle 😀
The add there serves just aesthetic purposes. 'Color' is a constant (-2) that represents an offset in the VGA palette. This way the background of the set becomes black.

Website, Facebook, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper

Reply 25 of 59, by root42

User metadata
Rank l33t
Rank
l33t

Thank you for the explanations! Now it seems to be all very clear. Not that super difficult at all. And much easier than EGA programming. 😁

YouTube and Bonus
80486DX@33 MHz, 16 MiB RAM, Tseng ET4000 1 MiB, SnarkBarker & GUSar Lite, PC MIDI Card+X2+SC55+MT32, OSSC

Reply 26 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2022-06-21, 06:42:
Hi, I have the strong feeling that your bank finding algorithm is more costly than that simple mul. That mul has the advantage t […]
Show full quote
WhiteFalcon wrote on 2022-06-21, 06:05:

I have a PutPixel separately for 640x, 800x and 1024x (not that I would ever need those) just to avoid that costy "mul" and split it into "shl"s, which reqires a constant number to work with.

Hi,
I have the strong feeling that your bank finding algorithm is more costly than that simple mul. That mul has the advantage that it gives you the right bank number immediately in dx (in terms of y coordinate/scanline) without using really costly loops and branches.
Mul behaves like a machine code that is specially optimized for this task: You give it a linear address of a line start and it gives you back the exact bank number and offset that line start belongs. Actually in case of 1024x modes you do not have to do anything else to get the proper bank since 1024 *64 is exactly at the segment border.

WhiteFalcon wrote on 2022-06-21, 06:05:

....
Later I realized it actually should not be that difficult to use a larger virtual screen, just couldnt wrap my head around the organization of the memory in that case.

The memory is organized the same way as if your virtual resolution would be a real resolution. That is you should treat it as a real 1280x480 or 640x960 resolution. It's just that simple.

There are two more things puzzling me:
1) the memory is linear so if the first approach you mention is used, the actual whole line is twice as long as the displayed screen, the other half hidden "to the right" until you use function 07 to switch the display to it. Meaning when you are displaying the first screen, the first line is 0-639 in vram and the second one is NOT 640-1279, but 1280-1919 (still in vram), correct?

As I have written above the memory organization is just the same as in case of a real resolution with the same dimensions.
So when you are displaying the first page it is from 0-639px and the second page is from 640-1279px. And your second Line on your first page is also 0-639px only difference is that your y coordinate is 1 and not 0 as in case of the first line. There is nothing like 1280-1919 in terms of pixels in this scenario. But it is true that the second line starts at byte offset 1280 in video memory.
Look at the vesaman example from my previous post. In the zip you can also find a compiled com file so you can test it without compiling the code.

Hence the gfx card must compose the whole screen of separate 640-byte pieces of its vram. The PutImage routine would have to use modeinfo.BytesPerScanLine to count the next line.

It's the same thing as if you would display 2 separate 640x480 bitmaps side by side on an 1280x480 real display. You should teat the situation the exact same way.

The second approach, lets say 640x960, sounds easier to me as the scanlines are the same as the horizontal resolution (640) and my mind can grasp that easier 😀 BUT, the first screen finishes somewhere in the middle of the fifth bank, can you use function 07 to start the second screen in the middle of a bank, byte-precise? Or do you need to "sacrifice" the rest of the fifth bank and put bank six in the top-left corner? The granularity of bank changing on my S3 Trio is 64K, that much I checked.

I have the feeling that you over complicate things here. It works the same way as in a real 640x960 resolution. The Putpixel routine has already handled calculating the proper bank/window. The virtual area beyond the 479th scanline is a simple continuation of the previous. You do not sacrifice anything, the next pixel should be calculated the same way. As I have written the memory is organized just like 640x960 would be a real resolution. And yes, you can use Function 07 to start displaying 'middle bank'. Actually you can start displaying at any pixel/offset! (e.g. x =647; y =489). Virtual scrolling works this way: It modifies the offset in one direction by only 1 pixel.

2) my head has enough with double buffering, let alone tripple buffering 😁 I have alway used double buffering in the way that I had one buffer that I cleared every frame, drew everything plus sprites on top and copied it over to 0xA000. That was possible in mode VGA 0x13. This is different as you just switch, is it necessary to use that store background-draw-restore background approach? If so, it would mean I have to draw every change in the background twice, to each buffer? Uff. While sprites on top (the cursor) only once? I can imagine syncing this process could be prone to mistakes. Redrawing everything in the other, not displayed buffer would be too slow and out of the question?

The whole point of this hardware assisted double buffering is that you do not have to use expensive memory copies. That means that you should do everything the same way as previously but you do not have to copy your buffer to 0xA000 since you are already "there". More precisely with the help of Function 07 you can place any of your buffers there without copying anything, by just switching 'pointers'. You are working directly with video memory all the time. The difference is just that your 2nd/back buffer is also in video memory and not in system memory.

Thank you for your time and for the sample program, I will read through it but probably wont be able to compile it my BorlandC is very strong-headed when it comes to inline asm. And I dont kno how to compile a pure asm file.

You do not have to compile it, there is a compiled com in the zip file. Just read the code and look at what the program does. The floating-point code part is irrelevant. It's there because I have had this Mandelbrot code from the 90's so I could reuse it. But yes, I enjoy writing x86 assembly for DOS and I even write assembly code for Win32/64 where it makes any sense.

I have always read that "mul" is too costly and that one should avoid using it when possible. So I went for "shl"s to compute the coordinates into one offset and then loops to get the bank nambur. But as you say, it is a deditated instruction. Still, I am probably failing to explain my goal - I need to be able to put a whole image there, a bitmap, not single pixels. This means I need to compute the offset just once at the start and then use only "add" from there on (my functions are still in C as I drown in ASM quickly when its longer). That is why I was asking about the memory addressing, I am not using a putpixel routine at all, I am using a pointer to 0xA000 with the correct bank set and plotting pixels to the right as needed. This way even large sprites dont need more than one or two bank switches and even a fullscreen image only 4 switches. And "mul" (or "shl") to get the coordinates (the total offset) is used just once for the whole bitmap.
I am getting lost in it a little but I appreciate your time to write all this info, it will take some chewing for me now to process it all 😀

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 27 of 59, by root42

User metadata
Rank l33t
Rank
l33t
WhiteFalcon wrote on 2022-06-21, 11:17:

I have always read that "mul" is too costly and that one should avoid using it when possible. So I went for "shl"s to compute the coordinates into one offset and then loops to get the bank nambur. But as you say, it is a deditated instruction. Still, I am probably failing to explain my goal - I need to be able to put a whole image there, a bitmap, not single pixels. This means I need to compute the offset just once at the start and then use only "add" from there on (my functions are still in C as I drown in ASM quickly when its longer). That is why I was asking about the memory addressing, I am not using a putpixel routine at all, I am using a pointer to 0xA000 with the correct bank set and plotting pixels to the right as needed. This way even large sprites dont need more than one or two bank switches and even a fullscreen image only 4 switches. And "mul" (or "shl") to get the coordinates (the total offset) is used just once for the whole bitmap.
I am getting lost in it a little but I appreciate your time to write all this info, it will take some chewing for me now to process it all 😀

If you need to do large blitting operations then just adding is of course much cheaper, or rather you could even rep stosb or rep stosw if you precompute the number of pixels you can blit before a line or page change is necessary. However on a 486 a mul is really cheap. On 8086 and 80286 you can accelerate the code by avoiding mul, that is true.

YouTube and Bonus
80486DX@33 MHz, 16 MiB RAM, Tseng ET4000 1 MiB, SnarkBarker & GUSar Lite, PC MIDI Card+X2+SC55+MT32, OSSC

Reply 28 of 59, by Falcosoft

User metadata
Rank Oldbie
Rank
Oldbie
WhiteFalcon wrote on 2022-06-21, 11:17:

I have always read that "mul" is too costly and that one should avoid using it when possible. So I went for "shl"s to compute the coordinates into one offset and then loops to get the bank nambur.

Yes, it's not the fastest instruction (especially not on P1) but it's not that slow as e.g. a div/idiv. And I wonder that at the same time you have not heard that branches can be even more fatal to performance. There is a complete school of technics called 'Branchless Programming' about how to avoid them because of this. And the point is that you have not used one (or even two) shl instead of a mul to determine proper bank (this would be definitely faster) but a find_bank/found_bank loop with branches.
https://en.algorithmica.org/hpc/pipelining/branchless/
https://dev.to/jobinrjohnson/branchless-progr … lly-matter-20j4

Still, I am probably failing to explain my goal - I need to be able to put a whole image there, a bitmap, not single pixels.
This means I need to compute the offset just once at the start and then use only "add" from there on (my functions are still in C as I drown in ASM quickly when its longer). That is why I was asking about the memory addressing, I am not using a putpixel routine at all, I am using a pointer to 0xA000 with the correct bank set and plotting pixels to the right as needed. This way even large sprites dont need more than one or two bank switches and even a fullscreen image only 4 switches. And "mul" (or "shl") to get the coordinates (the total offset) is used just once for the whole bitmap.
I am getting lost in it a little but I appreciate your time to write all this info, it will take some chewing for me now to process it all 😀

You can consider a Putixel routine as a 'fill' routine with a byte sized granularity. A 'FillRect' routine should follow the exact same principles about how to determine the proper bank. I do not really understand what would be the fundamental difference between the 2. First I would write a fast 'FillLine' routine using dword writes where possible and then FillRect would call FillLine for each row of the rectangle by simply adding 1280 offset(s) to starting address) .
The bank calculation would be implemented in FillLine.

Website, Facebook, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper

Reply 29 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
root42 wrote on 2022-06-21, 11:27:
WhiteFalcon wrote on 2022-06-21, 11:17:

I have always read that "mul" is too costly and that one should avoid using it when possible. So I went for "shl"s to compute the coordinates into one offset and then loops to get the bank nambur. But as you say, it is a deditated instruction. Still, I am probably failing to explain my goal - I need to be able to put a whole image there, a bitmap, not single pixels. This means I need to compute the offset just once at the start and then use only "add" from there on (my functions are still in C as I drown in ASM quickly when its longer). That is why I was asking about the memory addressing, I am not using a putpixel routine at all, I am using a pointer to 0xA000 with the correct bank set and plotting pixels to the right as needed. This way even large sprites dont need more than one or two bank switches and even a fullscreen image only 4 switches. And "mul" (or "shl") to get the coordinates (the total offset) is used just once for the whole bitmap.
I am getting lost in it a little but I appreciate your time to write all this info, it will take some chewing for me now to process it all 😀

If you need to do large blitting operations then just adding is of course much cheaper, or rather you could even rep stosb or rep stosw if you precompute the number of pixels you can blit before a line or page change is necessary. However on a 486 a mul is really cheap. On 8086 and 80286 you can accelerate the code by avoiding mul, that is true.

Yes I used to use rep stosb and stosw in classic VGA, but here its complicated enough for me with the bank switching that I will be happy to just copy single bytes 😀 Perhaps the "mul fear" comes from the old Asphyxia tutorials I loveds back then and they probably coded on a 386.

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 30 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2022-06-21, 12:03:
Yes, it's not the fastest instruction (especially not on P1) but it's not that slow as e.g. a div/idiv. And I wonder that at the […]
Show full quote
WhiteFalcon wrote on 2022-06-21, 11:17:

I have always read that "mul" is too costly and that one should avoid using it when possible. So I went for "shl"s to compute the coordinates into one offset and then loops to get the bank nambur.

Yes, it's not the fastest instruction (especially not on P1) but it's not that slow as e.g. a div/idiv. And I wonder that at the same time you have not heard that branches can be even more fatal to performance. There is a complete school of technics called 'Branchless Programming' about how to avoid them because of this. And the point is that you have not used one (or even two) shl instead of a mul to determine proper bank (this would be definitely faster) but a find_bank/found_bank loop with branches.
https://en.algorithmica.org/hpc/pipelining/branchless/
https://dev.to/jobinrjohnson/branchless-progr … lly-matter-20j4

Still, I am probably failing to explain my goal - I need to be able to put a whole image there, a bitmap, not single pixels.
This means I need to compute the offset just once at the start and then use only "add" from there on (my functions are still in C as I drown in ASM quickly when its longer). That is why I was asking about the memory addressing, I am not using a putpixel routine at all, I am using a pointer to 0xA000 with the correct bank set and plotting pixels to the right as needed. This way even large sprites dont need more than one or two bank switches and even a fullscreen image only 4 switches. And "mul" (or "shl") to get the coordinates (the total offset) is used just once for the whole bitmap.
I am getting lost in it a little but I appreciate your time to write all this info, it will take some chewing for me now to process it all 😀

You can consider a Putixel routine as a 'fill' routine with a byte sized granularity. A 'FillRect' routine should follow the exact same principles about how to determine the proper bank. I do not really understand what would be the fundamental difference between the 2. First I would write a fast 'FillLine' routine using dword writes where possible and then FillRect would call FillLine for each row of the rectangle by simply adding 1280 offset(s) to starting address) .
The bank calculation would be implemented in FillLine.

I am not arguing, I just need to find a balance between effective programming and my brain still making sense of it 😁 How I would prefer the old, simple linear 64k buffer we use in VGA.. but keeping it VESA 1.2 friendly and the requirements low is still a priority.

As for the filling, I am not talking about filling really, but a normal sprite routine - drawing a bitmap, an uncompressed array of bytes, ignoring color n. 0 for transparency (and also a version of that not ignoring color 0). Maybe a sample of my current routine will explain (yes I know its unoptimized, at the moment I am just glad it works and that I can understand it, albeit it does not yet support any virtual screen).

void VESA_DrawImageT640(long int x, long int y, char *image, long int sizex, long int sizey)
{
long int scr_pos, img_pos, a, b, bank = 0;
int x1 = 0, y1 = 0, x2 = sizex;

if (x < 0) {
x1 = -x;
x = 0;
}
if (y < 0) {
y1 = -y;
y = 0;
}
if (x + sizex > 640) {
x2 = 640 - x;
}

scr_pos = x + y * 640;
while (scr_pos > 0xFFFF) {
scr_pos -= 0x10000;
bank++;
}

img_pos = x1 + y1 * sizex;

VESA_SetBank(bank);

for (b=y1; b<sizey; b++) {
for (a=x1; a<x2; a++) {
if (scr_pos > 0xFFFF) {
scr_pos -= 0x10000;
bank++;
VESA_SetBank(bank);
}
if (image[img_pos]) screen[scr_pos] = image[img_pos];
scr_pos++;
img_pos++;
}
scr_pos += (640 - x2) + x1;
img_pos += (sizex - x2) + x1;
}

asm {

}
}

and this part

scr_pos += (640 - x2) + x1;

is one of the reasons I kept asking you about the memory arrangement when the width of the virtual screen is greater than the resolution width.

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 31 of 59, by root42

User metadata
Rank l33t
Rank
l33t
WhiteFalcon wrote on 2022-06-21, 13:30:
root42 wrote on 2022-06-21, 11:27:
WhiteFalcon wrote on 2022-06-21, 11:17:

I have always read that "mul" is too costly and that one should avoid using it when possible. So I went for "shl"s to compute the coordinates into one offset and then loops to get the bank nambur. But as you say, it is a deditated instruction. Still, I am probably failing to explain my goal - I need to be able to put a whole image there, a bitmap, not single pixels. This means I need to compute the offset just once at the start and then use only "add" from there on (my functions are still in C as I drown in ASM quickly when its longer). That is why I was asking about the memory addressing, I am not using a putpixel routine at all, I am using a pointer to 0xA000 with the correct bank set and plotting pixels to the right as needed. This way even large sprites dont need more than one or two bank switches and even a fullscreen image only 4 switches. And "mul" (or "shl") to get the coordinates (the total offset) is used just once for the whole bitmap.
I am getting lost in it a little but I appreciate your time to write all this info, it will take some chewing for me now to process it all 😀

If you need to do large blitting operations then just adding is of course much cheaper, or rather you could even rep stosb or rep stosw if you precompute the number of pixels you can blit before a line or page change is necessary. However on a 486 a mul is really cheap. On 8086 and 80286 you can accelerate the code by avoiding mul, that is true.

Yes I used to use rep stosb and stosw in classic VGA, but here its complicated enough for me with the bank switching that I will be happy to just copy single bytes 😀 Perhaps the "mul fear" comes from the old Asphyxia tutorials I loveds back then and they probably coded on a 386.

Honestly, you can very simply compute in advance the value for CX when doing rep stosw (or even rep stosd on 386) for your blitting. And once you stop, you can switch the bank, and do the next chunk. If you blit rectangles, you have to compute the horizontal line spans first, which gives you the amount of repetitions for each line. Then have an outer loop to copy as many lines as would fit still in the current page. Then bank switch, rinse and repeat.

YouTube and Bonus
80486DX@33 MHz, 16 MiB RAM, Tseng ET4000 1 MiB, SnarkBarker & GUSar Lite, PC MIDI Card+X2+SC55+MT32, OSSC

Reply 32 of 59, by Falcosoft

User metadata
Rank Oldbie
Rank
Oldbie
WhiteFalcon wrote on 2022-06-21, 13:33:
and this part […]
Show full quote

and this part

scr_pos += (640 - x2) + x1;

is one of the reasons I kept asking you about the memory arrangement when the width of the virtual screen is greater than the resolution width.

I still do not understand the problem and why you need 'memory arrangement' info.
1. Let's suppose you already set 1280x480 virtual resolution. Thus you have 2 pages/buffers: 1. front buffer that is always visible and 2. back buffer that you always use for drawing.
2. You should define a global variable that stores which page is your back buffer (let's name it 'ActualDrawingPage'). At start it is page 1 that starts at x offset 640. So lets give the 'ActualDrawingPage' variable a value of 1.
3. If you finished drawing then you call Function 07 by giving it the input arguments cx=640; dx=0; Now you are using page 0 as active drawing page so let's set 'ActualDrawingPage' variable to be 0.
4. If you finished drawing again you should call Function 07 again by giving it the input arguments cx=0; dx=0; and set 'ActualDrawingPage' variable to be 1 again.
Repeat the cycle.
And now to answer your actual question, you should insert this to correct your code (you should add the offset value earlier so it can be part of the bank value calculation):

scr_pos = (x + y * 1280) + (ActualDrawingPage * 640);

That is if your drawing page is 1 then y0u should add 640 to all your x coordinates (this time page 0 is visible: 0-639), but when your drawing page is 0 you should not add anything (this time page 1 is visible: 640-1279). That's it.
BTW, if you want flicker free operation you should call function 07 with bx=80h. This way you should get v-synced page flipping.

Website, Facebook, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper

Reply 33 of 59, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

in its time, vesa was not really a thing on 286 VGA cards, so I've only ever done VESA programming in pmode using the linear framebuffer. bank switching is just a pita and slows things down. since your not targeting a 286, not using the LFB seems like punishment 😀 I'll have to dig around, I have somewhere a VESA library I wrote for watcom c doing some dpmi calls. i remember writing a 640x480 demo and being surprised how bad some cards implementations were 🙁

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 34 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2022-06-21, 15:44:
I still do not understand the problem and why you need 'memory arrangement' info. 1. Let's suppose you already set 1280x480 virt […]
Show full quote
WhiteFalcon wrote on 2022-06-21, 13:33:
and this part […]
Show full quote

and this part

scr_pos += (640 - x2) + x1;

is one of the reasons I kept asking you about the memory arrangement when the width of the virtual screen is greater than the resolution width.

I still do not understand the problem and why you need 'memory arrangement' info.
1. Let's suppose you already set 1280x480 virtual resolution. Thus you have 2 pages/buffers: 1. front buffer that is always visible and 2. back buffer that you always use for drawing.
2. You should define a global variable that stores which page is your back buffer (let's name it 'ActualDrawingPage'). At start it is page 1 that starts at x offset 640. So lets give the 'ActualDrawingPage' variable a value of 1.
3. If you finished drawing then you call Function 07 by giving it the input arguments cx=640; dx=0; Now you are using page 0 as active drawing page so let's set 'ActualDrawingPage' variable to be 0.
4. If you finished drawing again you should call Function 07 again by giving it the input arguments cx=0; dx=0; and set 'ActualDrawingPage' variable to be 1 again.
Repeat the cycle.
And now to answer your actual question, you should insert this to correct your code (you should add the offset value earlier so it can be part of the bank value calculation):

scr_pos = (x + y * 1280) + (ActualDrawingPage * 640);

That is if your drawing page is 1 then y0u should add 640 to all your x coordinates (this time page 0 is visible: 0-639), but when your drawing page is 0 you should not add anything (this time page 1 is visible: 640-1279). That's it.
BTW, if you want flicker free operation you should call function 07 with bx=80h. This way you should get v-synced page flipping.

Now your modified formula replied to my question about memory arrangement 😀 As you are always adding 640 for the second page, it means that it really is mapped "across" the whole virtual area and then starts on the next line, I just needed it confirmed.

I understand the general concept of this kind of double buffering, both from DOS and SDL in Windows. But as I mentioned earlier, I have always preferred the slower copy-over method when speed did not allow to draw each frame from scratch. In SDL where spees was never a problem I did use this exact kind of double buffering (I recall a function called Flip()) as I always redrew the screen from scratch. Its much easier, you dont need to keep a another buffer for every sprite to store the piece of background it is pasted over (aka dirty rectangles).
As drawing speed will probably be a problem (and you will know that better than me looking at your advanced program), I take it the dirty rectangles approach will be necessary. So when we look at the start of the program, you will perhaps see what I find confusing:
1) screen 0 is visible, black. Draw background to screen 1. Store the parts where sprites are going to be. Draw the sprites. Switch screen 0 and 1.
2) screen 1 is visible, everything present. Draw background again to screen 0. Restore sprite backgrounds. Update sprite position. Store the parts where sprites are going to be. Draw the sprites. Switch screen 0 and 1.
3) screen 0 is visible, everything present. Restore sprite backgrounds. Update sprite position. Store the parts where sprites are going to be. Draw the sprites. Switch screen 0 and 1.
4) screen 1 is visible, everything present. Restore sprite backgrounds. Update sprite position. Store the parts where sprites are going to be. Draw the sprites. Switch screen 0 and 1.
5) goto 3)

This is not a uniform main loop, the first two steps are unique. Perhaps I am overcomplicating things and should start by drawing the background to both screens and then jump to 3) for the main loop? But then when the background needs changing, it would mean suddenly drawing to both screens (interrupting the main loop) and after returning to the loop, restoring some of the previous background that is still stored.
I will be the first to acknowledge I dont have a programmer's mind 😀 But I have always liked programming no matter my lack of math understanding and conceptual skills.

Or is it okay to draw all the screen for every frame? Background, sprites and all? It did work in VGA 320x200, I think even on a 486, but that was less than 1/2 of the pixels. I am afraid it would give me like 5-10 FPS on the Pentium, if even that many.

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 35 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
root42 wrote on 2022-06-21, 14:53:
WhiteFalcon wrote on 2022-06-21, 13:30:
root42 wrote on 2022-06-21, 11:27:

If you need to do large blitting operations then just adding is of course much cheaper, or rather you could even rep stosb or rep stosw if you precompute the number of pixels you can blit before a line or page change is necessary. However on a 486 a mul is really cheap. On 8086 and 80286 you can accelerate the code by avoiding mul, that is true.

Yes I used to use rep stosb and stosw in classic VGA, but here its complicated enough for me with the bank switching that I will be happy to just copy single bytes 😀 Perhaps the "mul fear" comes from the old Asphyxia tutorials I loveds back then and they probably coded on a 386.

Honestly, you can very simply compute in advance the value for CX when doing rep stosw (or even rep stosd on 386) for your blitting. And once you stop, you can switch the bank, and do the next chunk. If you blit rectangles, you have to compute the horizontal line spans first, which gives you the amount of repetitions for each line. Then have an outer loop to copy as many lines as would fit still in the current page. Then bank switch, rinse and repeat.

This sounds as easy as it sounds in my head when I envision it, then programming starts and harsh reality kicks in 😉 But I know it is doable, even with the annoying bank switches and having to plot the odd pixels individually, out of stosw (let alone stosd).

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 36 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
BloodyCactus wrote on 2022-06-21, 18:08:

in its time, vesa was not really a thing on 286 VGA cards, so I've only ever done VESA programming in pmode using the linear framebuffer. bank switching is just a pita and slows things down. since your not targeting a 286, not using the LFB seems like punishment 😀 I'll have to dig around, I have somewhere a VESA library I wrote for watcom c doing some dpmi calls. i remember writing a 640x480 demo and being surprised how bad some cards implementations were 🙁

My sentiment exactly 😁 I am definitely not targetting the 286, this is a nostalgia trip and I never had one. My first was a 386DX/40 so I would like for it to work on that (to some extent) and as its SVGA and I remember I could not play SVGA games on my 386 (starting with only 256kB vram there), it would be fine if it worked well on a 486DX2/66 which is another kind of hallmark of the era.
I read about pmode several times and its definitely not for me, couldnt wrap my head around even around the basics. Everything seems to hostile in pmode, I prefer the real one. And without pmode, bye bye LFB. So thats not an option. SVGA card compatibility is another issue, one that I would like to avoid completely by staying with the most used resolution of 640x480x8 and VESA 1.2. Not planning to be 100% compatible, just to hit the majority (80%+) of cards. S3 and Trident should cut it.
Also I should mention the need to make everything from scratch and understanding what I am doing, not using any libraries and such 😀

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)

Reply 37 of 59, by Falcosoft

User metadata
Rank Oldbie
Rank
Oldbie
WhiteFalcon wrote on 2022-06-22, 05:09:

Or is it okay to draw all the screen for every frame? Background, sprites and all? It did work in VGA 320x200, I think even on a 486, but that was less than 1/2 of the pixels. I am afraid it would give me like 5-10 FPS on the Pentium, if even that many.

You have to try. I always use this double buffering 'flip' concept and redraw full pages. But I must say I do not think that banked 640x480 256 color mode on a 486 (and 20+) fps is a realistic expectation. VGA 320x200 has less than 1/4 of the pixels (not 1/2). I think even the linear frame buffer mode of 640x480 could have performance problems on a 486.
I do not know your exact program and what your Store_xxx and Restore_xxx routines do but one thing is sure: Never read from video memory if performance is important. Writing to video memory is typically faster than writing to system memory but reading from video memory is always much slower than reading from system memory. So make sure your store routines do not save the video buffer.
BTW you can also use 320x240 VESA mode the same way as 640x480 and with some additional work you can offer your users a low-res and a high-res mode as selectable options.

Website, Facebook, Youtube
Falcosoft Soundfont Midi Player + Munt VSTi + BassMidi VSTi
VST Midi Driver Midi Mapper

Reply 38 of 59, by root42

User metadata
Rank l33t
Rank
l33t
Falcosoft wrote on 2022-06-22, 06:29:
You have to try. I always use this double buffering 'flip' concept and redraw full pages. But I must say I do not think that ban […]
Show full quote
WhiteFalcon wrote on 2022-06-22, 05:09:

Or is it okay to draw all the screen for every frame? Background, sprites and all? It did work in VGA 320x200, I think even on a 486, but that was less than 1/2 of the pixels. I am afraid it would give me like 5-10 FPS on the Pentium, if even that many.

You have to try. I always use this double buffering 'flip' concept and redraw full pages. But I must say I do not think that banked 640x480 256 color mode on a 486 (and 20+) fps is a realistic expectation. VGA 320x200 has less than 1/4 of the pixels (not 1/2). I think even the linear frame buffer mode of 640x480 could have performance problems on a 486.
I do not know your exact program and what your Store_xxx and Restore_xxx routines do but one thing is sure: Never read from video memory if performance is important. Writing to video memory is typically faster than writing to system memory but reading from video memory is always much slower than reading from system memory. So make sure your store routines do not save the video buffer.
BTW you can also use 320x240 VESA mode the same way as 640x480 and with some additional work you can offer your users a low-res and a high-res mode as selectable options.

Good idea. Or even 640x400 as I mentioned above. If you are coding an adventure game (inventory comes to mind) you don't need to do full screens, and even the 20 FPS is not that important. I think that is very well doable on a 486, if you refrain from too large sprites or full screen scrolling and redraws. Also, color cycling can animate large swaths of image... 😀

YouTube and Bonus
80486DX@33 MHz, 16 MiB RAM, Tseng ET4000 1 MiB, SnarkBarker & GUSar Lite, PC MIDI Card+X2+SC55+MT32, OSSC

Reply 39 of 59, by WhiteFalcon

User metadata
Rank Newbie
Rank
Newbie
Falcosoft wrote on 2022-06-22, 06:29:
You have to try. I always use this double buffering 'flip' concept and redraw full pages. But I must say I do not think that ban […]
Show full quote
WhiteFalcon wrote on 2022-06-22, 05:09:

Or is it okay to draw all the screen for every frame? Background, sprites and all? It did work in VGA 320x200, I think even on a 486, but that was less than 1/2 of the pixels. I am afraid it would give me like 5-10 FPS on the Pentium, if even that many.

You have to try. I always use this double buffering 'flip' concept and redraw full pages. But I must say I do not think that banked 640x480 256 color mode on a 486 (and 20+) fps is a realistic expectation. VGA 320x200 has less than 1/4 of the pixels (not 1/2). I think even the linear frame buffer mode of 640x480 could have performance problems on a 486.
I do not know your exact program and what your Store_xxx and Restore_xxx routines do but one thing is sure: Never read from video memory if performance is important. Writing to video memory is typically faster than writing to system memory but reading from video memory is always much slower than reading from system memory. So make sure your store routines do not save the video buffer.
BTW you can also use 320x240 VESA mode the same way as 640x480 and with some additional work you can offer your users a low-res and a high-res mode as selectable options.

Ouch, a typo, yes, it should have been less than 1/4. Probably had too high expectations forgetting how even commercial games had trouble on the 486. Games like Little Big Adventure or Transport Tycoon definitely redrew the screen for every frame and must have employed very optimised assembler routines I guess.
I plan to work with photos, tried various options: 640x480x8/16, 320x200x8/16, 320x240x8/16... and 640x480x8 still looked okay while having 1/2 of the load due to 2-byte pixels compared to 16bit (no typo this time 😀) Also it seems my S3 Trio does not support many modes, especially not hicolor low res.
Good to know about the reads from vram, I am doing that for the store routine. Will have to try the redraw it all approach you too are using and see how fast it goes.

Olivetti M4 P75, 32MB RAM, 4GB HDD, SoundBlaster AWE 64, Gravis Ultrasound MAX, Roland SCC-1, Roland MT-32, Roland CM-64
Intel 486DX2/66Mhz, 8MB RAM, VGA Trident 512kB, 264MB HDD Quantum, SoundBlaster 16 Pro (CT2910)