I assume with copying word by word you mean REP MOVSW? Copying double words would be REP MOVSD, available from 386 on. Even in 16 bit real mode you could use this, the assembler would add the 66h prefix. What do you mean by "cpu itself still executes the command in word by word"? With a 386SX or an ISA video card this might not make a difference, in all other cases it should be faster.
I suppose you are using the standard vga mode 13h. If you use a VESA mode with the same resolution instead you could enable write combining through MTRRs. 64kB x 490 is only roughly 30MB/s, with REP MOVSD you could probably get twice the speed. Of course MTRRs won't work (or won't do anything) in DOSBox, REP MOVSD should help there.
Still if your program gets ~80 fps the data transfer doesn't seem too limiting. Maybe it's more that you are using 16-bit code and TP 7.0, 3D games like Blood and DN3D were usually using Watcom C (32-bit code and better optimizer) and some parts optimized assembler.