Okay I have tested it. Doing word-based transfers is only 1% faster than byte-based transfers when the EMS card is 8 bit (tested on 86Box with an emulated 286 with a LoTech EMS card).
I do not have a 16-bit EMS card nor I know how to emulate one, so I cannot test the more optimistic scenario. If anyone has a 16-bit EMS card and is willing to test I can provide the executables to do so.
EXMS86 0.9.5 is out with its batch of improvements:
- in ! mode only lower 32K of the page frame is dedicated to UMB so it is 100% safe
- new mode "!!" enables 64K of UMB, also fiddles with interrupts during transfers (experimental)
- even-length transfers use 16-bit ops so it should be faster on non-8088 CPUs with 16-bit EMS cards (can anyone confirm on actual hardware?)
- quiet mode ("EXMS86.COM Q")
given that you say you are using two pages, I assume you get two chunks per 16KB to deal with alignment.
Depends what the requested XMS offsets are, but yes - in worst case I transfer three chunks at start (to deal with page alignment for src and dst), and then only 1 chunk for every further page.
I wonder how you do a single chunk per page. Assuming I want to transfer 48KB from a page-aligned address to an address that is offset by 8K. This means
low 8K from source page 1 to high 8K of destination page 1 (low 8K of destination page 1 is not in transfer range)
high 8K from source page 1 to low 8K of destination page 2
low 8K from source page 2 to high 8K of destination page 2
high 8K from source page 2 to low 8K of destination page 3
low 8K from source page 3 to high 8K of destination page 3
high 8K from source page 3 to low 8K of destination page 4
Every bullet point in this list requires a different page mapping. It's obvious to see that this scheme requires a re-map every 8K, no matter how large the transfer is, which is 2 chunks per page.
Speaking about performances: EXMS86 uses rep movsb to copy bytes. Using rep movsw could be more performant, but again - more complexity to deal with odd addresses and odd lengths
There is a simple assembler patterns to deal with odd lengths. Admittedly, this pattern does not care about alignment. On processors with WT L1 cache aligned destination is more important, but I don't see cached architectures as targets for EXMS86.
and the performances would not be better on a 8088 anyway
I think this is a misconception. Yes, the 8088 can not do word transfers on the bus, but the 8086/8088 has a quite slow microcoded loop for REP MOVS. A single iteration of REP MOVS is 17 clocks if you are transferring bytes. This is 8 clocks for data transfer and 9 clocks microcode overhead. If you are transferring words (any words on the 8088, unaligned words on the 8086) you get 4 additional clocks for the second bus cycle. This means a single iteration of REP MOVSW on the 8088 takes 9 clocks overhead + 2*8 clocks data transfer = 25 clocks. Two iterations of REP MOVSB on the other hand take 2*9 clocks overhead + 2*8 clocks data transfer = 34 clocks. This means REP MOVSW takes only 74% of the time used by REP MOVSB on the 8088, and alignment doesn't matter at all.
I have a Turbo XT (although not at hand at the moment) that inserts a wait state at 10MHz for ISA memory cycles (but not for onboard memory cycles), so the relative advantage of using REP MOVSW over REP MOVSB is lower. On that system (assuming the bus is not occupied by memory refresh, which it is a lot), two iterations of REP MOVSB would take 38 clocks and one iteration of REP MOVSW would take 29 clocks, so REP MOVSW takes 85% of the time of REP MOVSB.
Nevertheless, even if you don't care about alignment, using the pattern I quoted above instead of plain REP MOVSB is consistently faster. If you do care about alignment and the memory (or at least one part of the transfer) is 16-bit capable on a system with a 16-bit wide bus, REP MOVSW will get even faster. You definitely lose some cycles if CX is very low, but I guess the small amount of extra cycles lost is negligible compared to the overhead of calling the XMS driver and mapping pages.
If you don't have a physical 8088 machine at hand, and need to resort to emulators for performance tests, I highly recommend using MartyPC for accurate benchmarks of 8088 code, at least if you have a sufficiently strong host machine that MartyPC runs at (or near) real time.
mkarcherwrote on Yesterday, 19:50:I wonder how you do a single chunk per page. Assuming I want to transfer 48KB from a page-aligned address to an address that is […] Show full quote
I wonder how you do a single chunk per page. Assuming I want to transfer 48KB from a page-aligned address to an address that is offset by 8K. This means
low 8K from source page 1 to high 8K of destination page 1 (low 8K of destination page 1 is not in transfer range)
high 8K from source page 1 to low 8K of destination page 2
low 8K from source page 2 to high 8K of destination page 2
high 8K from source page 2 to low 8K of destination page 3
low 8K from source page 3 to high 8K of destination page 3
high 8K from source page 3 to low 8K of destination page 4
That is an excellent point. When I said I was doing 1 chunk per page I was thinking about XMS-to-RAM (or RAM-to-XMS). XMS-to-XMS transfers are trickier. I quickly looked at my code and I see that page-unaligned XMS transfers will likely break after the first page barrier. I am only half surprised since XMS-to-XMS transfers are a test that I have yet to add into my testing suite (along with RAM-to-RAM). So far I was extensively testing only XMS-to-RAM and RAM-to-XMS. It's good that there are still quite some version numbers available before 1.0 :-)
There is a simple assembler patterns to deal with odd lengths. (...)
1shr cx, 1 2rep movsw 3adc cx, 0 4rep movsb
This is brilliant! Did you come up with this yourself, or is it some kind of asm common knowledge? I am no assembly whiz so this is gold to me. In v0.9.5 I do movsw already, but only after testing cx for parity. Your solution is vastly superior as it uses movsw also for odd lengths, relying on movsb only for the last byte. Plus it has no branching.
Yes, the 8088 can not do word transfers on the bus, but the 8086/8088 has a quite slow microcoded loop for REP MOVS. A single iteration of REP MOVS is 17 clocks if you are transferring bytes. This is 8 clocks for data transfer and 9 clocks microcode overhead. If you are transferring words (any words on the 8088, unaligned words on the 8086) you get 4 additional clocks for the second bus cycle. This means a single iteration of REP MOVSW on the 8088 takes 9 clocks overhead + 2*8 clocks data transfer = 25 clocks. Two iterations of REP MOVSB on the other hand take 2*9 clocks overhead + 2*8 clocks data transfer = 34 clocks. This means REP MOVSW takes only 74% of the time used by REP MOVSB on the 8088
That's very interesting. The difference I measured on 86box is of only 1% in favor of movsw, though. Maybe because the emulation is not perfect, or maybe (more likely I guess) because the movsb/movsw difference is not significant enough to make a difference when compared to the overall EXMS86 overhead.
If you don't have a physical 8088 machine at hand, and need to resort to emulators for performance tests, I highly recommend using MartyPC for accurate benchmarks of 8088 code, at least if you have a sufficiently strong host machine that MartyPC runs at (or near) real time.
I wasn't disappointed by 86box so far, but MartyPC is definitely on my list of things to check out. It has been suggested to me before, because of its very interesting debugger.
That is an excellent point. When I said I was doing 1 chunk per page I was thinking about XMS-to-RAM (or RAM-to-XMS). XMS-to-XMS transfers are trickier. I quickly looked at my code and I see that page-unaligned XMS transfers will likely break after the first page barrier. I am only half surprised since XMS-to-XMS transfers are a test that I have yet to add into my testing suite (along with RAM-to-RAM). So far I was extensively testing only XMS-to-RAM and RAM-to-XMS. It's good that there are still quite some version numbers available before 1.0 😀
This also shows that XMS-to-XMS is not a common use case of the XMS API, because otherwise you would have noticed the issue already. I supposed you were considering XMS-to-XMS already, because you talked about different pages in the frame to be used for "source" and "destination", and if you just do XMS-to-conventional and conventional-to-XMS, a single page (either as source or destination) would suffice.
There is a simple assembler patterns to deal with odd lengths. (...)
1shr cx, 1 2rep movsw 3adc cx, 0 4rep movsb
This is brilliant! Did you come up with this yourself, or is it some kind of asm common knowledge?
I do a lot of reverse engineering on retro stuff. I did not design that fragment myself, but I think I've encountered it repeatedly. I can't attribute it to a specific source, and I suppose that pattern was invented independently multiple times in the 1980s.
I supposed you were considering XMS-to-XMS already, because you talked about different pages in the frame to be used for "source" and "destination", and if you just do XMS-to-conventional and conventional-to-XMS, a single page (either as source or destination) would suffice.
Absolutely, yes. XMS-to-XMS (and RAM-to-RAM) are two scenarios that I keep in mind since day 0. I tried to organize EXMS86 so it allows for these operations, but never actually tested them as it is a very niche usage that seems rarely (if ever) used. XMS-to-XMS and RAM-to-RAM will most probably be the focus of the next version, I just have to extend my test suite first.
I did not design that fragment myself, but I think I've encountered it repeatedly. I can't attribute it to a specific source, and I suppose that pattern was invented independently multiple times in the 1980s.
Very cool. I've added it already to the EXMS86 0.9.6 branch - works as expected, passes all my tests. Thanks for the tip!
I did not design that fragment myself, but I think I've encountered it repeatedly. I can't attribute it to a specific source, and I suppose that pattern was invented independently multiple times in the 1980s.
Very cool. I've added it already to the EXMS86 0.9.6 branch - works as expected, passes all my tests. Thanks for the tip!
I asked an AI system to find out whether this pattern is known. The AI system suggested an even better branchless variant using ADC CX, CX instead of ADD CX,0, which saves one byte (and one bus cycle / 4 clocks of fetching instruction bytes). It's likely that I misremembered the pattern, and [m}ADD CX, CX[/m] is more common.
The AI also suggested a branchful version as well, using JNC past_movsb; MOVSB; past_movsb:. Starting to discuss timing with AI was only moderately successful. The AI claimed that REP takes two clocks plus the repeated executions. So, according to the AI, REP MOVSB should take 2 clocks if CX is zero and 20 clocks if CX is one (a standalone MOVSB takes 18 clocks). This is wrong. My third-party reference indicates REP MOVSB as 9 + 17*CX, which is also possibly not proper. The original Intel Users' manual quotes 2 clocks for the REP prefix and 9 + 17*CX for repeated MOVS, but has no clear indication whether the two clocks to process the REP prefix are included in the 9 + 17*CX specification, or need to be added for a total time of 11 + 17*CX. The branchful version has the advantage of being another byte shorter.
Generally, writing short code is important on the 8088, because there is barely sufficient bandwidth on the bus to fetch the instructions fast enough. Consider for example my suboptimal suggestion ADD CX, 0, which is quoted as requiring 4 clocks to execute - but at an instruction size of 3 bytes, it takes 12 clocks (+bus delays) to fetch that instruction. On the other hand, this makes the slow execution of REP MOVSW not that bad. As there are 9 clocks per iteration in which the bus is idle, the time can at least be used to fill the prefetch queue.
In the end, (non-busmaster) ISA DMA was only used in cases where high throughput wasn't required, but background transfers were important. Sound cards fit this pattern perfectly. The programming model of ISA DMA is awful, and if EMM386 virtualizes RAM, it also needs to virtualize DMA, which is a very cumbersome operation. So nobody in the software industry liked ISA DMA. This explains why system designers were happy to get rid of the obselete ISA DMA scheme when they designed PCI systems. They did not expect the use case of "ISA compatible sound cards" being that important. Later on, some standards (eg. PC/PCI) were designed to give PCI sound cards access to the ISA DMA/IRQ system, and that's why PCI cards can't generally offer a nice Plug&Play experience including soundblaster compatibility.
That context is also helpful for those of us who wonder why ECP parallel ports--added at a fairly late date in the ISA bus timeline--bothered to include ISA DMA support.
The only other thing I have to add is that this OS2Museum post https://www.os2museum.com/wp/386max-and-eisa-dma/ talks about early Intel PCI chipsets having an EISA DMA controller even on systems with only ISA slots, and 386MAX tried to take advantage of that. Among other things, it meant no 16MB addressing limitation, but this support withered away by the time of Triton and other ubiquitous Intel chipsets.
That context is also helpful for those of us who wonder why ECP parallel ports--added at a fairly late date in the ISA bus timeline--bothered to include ISA DMA support.
I think ECP also includes a consideration I only touched slightly in that post, which is multitasking and background operation. I did write DMA was used where "background transfers were important", and while thinking about sound, this already applies to games in a single tasking environment, the idea of ECP was to integrate the slow parallel port transfers well into a modern desktop operating system like Windows 95, and make that port more generic. Note that ECP also includes device addressing for daisy-chained devices and does no longer focus to "printers only". I guess the latest incarnation of the parallel port ZIP drive did support ECP-type transfers. I would consider ECP to be a mostly failed attempt to introduce a general-purpose parallel bus (call it "UPB" - "universal parallel bus" if you want). If you ask for why ECP did not get major support in hardware, it's likely because the parallel cables were bulky and expensive. By the time ECP could have taken off seriously, we already had USB 1.1, and USB 1.1 solved all the issues ECP was intending to solve. Note how USB host controllers also include DMA support (as this is PCI, it's busmaster, because PCI no longer has a central DMA controller).
The only other thing I have to add is that this OS2Museum post https://www.os2museum.com/wp/386max-and-eisa-dma/ talks about early Intel PCI chipsets having an EISA DMA controller even on systems with only ISA slots, and 386MAX tried to take advantage of that. Among other things, it meant no 16MB addressing limitation, but this support withered away by the time of Triton and other ubiquitous Intel chipsets.
I actually do have an ISA/PCI Saturn II mainboard with an EISA-like DMA controller, the Asus PCI/I-486SP3G. Obviously, it does not support 32-bit transfers, as there are no EISA slots. It would have been a really interesting alternate history if that idea had taken off, and DMA controllers like that would have become standard on PC mainboard in 1994. But as Intel was the only vendor of chipsets lilke that while the low-end competition continued to use their 82c206-type "AT on a chip" ISA multi-function controllers, there never was a critical mass of EISA-DMA-capabable consumer PCs so that development of software for that feature made sense. This may be different in server environments. Note how the Adaptec 1522 OS/2 driver does support DMA transfers in EISA mainboards - but the drivers for all other systems do not.
^I remember that consumer class flat-bed scanners had supported LPT and USB connections at the time (both ports available on the back, often).
Professional scanners had used SCSI more often, I think.
So when I think of late ECP/EPP standards, then 8-Bit SCSI with D-Sub connector comes to mind.
About USB 1.x.. MacOS 8.6/9 supported it rather well already, but I guess that OS support on PCs was hit and miss by late 90s.
Windows for Workgroups, NT 3.5x/4 and Windows 95 RTM didn't support it, neither did OS/2 Warp 4, QNX or Linux.
Well, Linux sort of did, but I vaguely remember its USB support wasn't mature yet. Linux of the 90s wanted "good hardware" from the Sperrmüll.
Outdated 2D PCI graphics cards, industry standard PC hardware on motherboard, SCSI devices if possible.
Late 90s stuff such as newest PATA drives/PATA controllers, non-standard ACPI tables and power management in general wasn't best being supported yet.
Speaking under correction here, though. It's been 25+ years now.
"Time, it seems, doesn't flow. For some it's fast, for some it's slow.
In what to one race is no time at all, another race can rise and fall..." - The Minstrel
The AI system suggested an even better branchless variant using ADC CX, CX instead of ADD CX,0, which saves one byte
I should have thought about this. CX is guaranteed to be zero after rep, so it's an obvious opportunity to use it instead of a literal zero. I've changed it in EXMS86. It has no impact whatsoever on performances of course, since the rest of the code has a vastly greater overhead. But still nice to be optimal in this one tiny place. :)
The AI also suggested a branchful version as well, using JNC past_movsb; MOVSB; past_movsb:.
This one I had pondered about already. It is indeed a byte shorter, but it is branching. My understanding is that branching is better avoided because it forces the 8088 to flush its prefetch queue. I guess that a 1 byte saving does not make it up for the prefetch queue loss, the saving would probably need to be of a dozen bytes at least for the branch to be interesting. (?)