Okay I have tested it. Doing word-based transfers is only 1% faster than byte-based transfers when the EMS card is 8 bit (tested on 86Box with an emulated 286 with a LoTech EMS card).
I do not have a 16-bit EMS card nor I know how to emulate one, so I cannot test the more optimistic scenario. If anyone has a 16-bit EMS card and is willing to test I can provide the executables to do so.
EXMS86 0.9.5 is out with its batch of improvements:
- in ! mode only lower 32K of the page frame is dedicated to UMB so it is 100% safe
- new mode "!!" enables 64K of UMB, also fiddles with interrupts during transfers (experimental)
- even-length transfers use 16-bit ops so it should be faster on non-8088 CPUs with 16-bit EMS cards (can anyone confirm on actual hardware?)
- quiet mode ("EXMS86.COM Q")
given that you say you are using two pages, I assume you get two chunks per 16KB to deal with alignment.
Depends what the requested XMS offsets are, but yes - in worst case I transfer three chunks at start (to deal with page alignment for src and dst), and then only 1 chunk for every further page.
I wonder how you do a single chunk per page. Assuming I want to transfer 48KB from a page-aligned address to an address that is offset by 8K. This means
low 8K from source page 1 to high 8K of destination page 1 (low 8K of destination page 1 is not in transfer range)
high 8K from source page 1 to low 8K of destination page 2
low 8K from source page 2 to high 8K of destination page 2
high 8K from source page 2 to low 8K of destination page 3
low 8K from source page 3 to high 8K of destination page 3
high 8K from source page 3 to low 8K of destination page 4
Every bullet point in this list requires a different page mapping. It's obvious to see that this scheme requires a re-map every 8K, no matter how large the transfer is, which is 2 chunks per page.
Speaking about performances: EXMS86 uses rep movsb to copy bytes. Using rep movsw could be more performant, but again - more complexity to deal with odd addresses and odd lengths
There is a simple assembler patterns to deal with odd lengths. Admittedly, this pattern does not care about alignment. On processors with WT L1 cache aligned destination is more important, but I don't see cached architectures as targets for EXMS86.
and the performances would not be better on a 8088 anyway
I think this is a misconception. Yes, the 8088 can not do word transfers on the bus, but the 8086/8088 has a quite slow microcoded loop for REP MOVS. A single iteration of REP MOVS is 17 clocks if you are transferring bytes. This is 8 clocks for data transfer and 9 clocks microcode overhead. If you are transferring words (any words on the 8088, unaligned words on the 8086) you get 4 additional clocks for the second bus cycle. This means a single iteration of REP MOVSW on the 8088 takes 9 clocks overhead + 2*8 clocks data transfer = 25 clocks. Two iterations of REP MOVSB on the other hand take 2*9 clocks overhead + 2*8 clocks data transfer = 34 clocks. This means REP MOVSW takes only 74% of the time used by REP MOVSB on the 8088, and alignment doesn't matter at all.
I have a Turbo XT (although not at hand at the moment) that inserts a wait state at 10MHz for ISA memory cycles (but not for onboard memory cycles), so the relative advantage of using REP MOVSW over REP MOVSB is lower. On that system (assuming the bus is not occupied by memory refresh, which it is a lot), two iterations of REP MOVSB would take 38 clocks and one iteration of REP MOVSW would take 29 clocks, so REP MOVSW takes 85% of the time of REP MOVSB.
Nevertheless, even if you don't care about alignment, using the pattern I quoted above instead of plain REP MOVSB is consistently faster. If you do care about alignment and the memory (or at least one part of the transfer) is 16-bit capable on a system with a 16-bit wide bus, REP MOVSW will get even faster. You definitely lose some cycles if CX is very low, but I guess the small amount of extra cycles lost is negligible compared to the overhead of calling the XMS driver and mapping pages.
If you don't have a physical 8088 machine at hand, and need to resort to emulators for performance tests, I highly recommend using MartyPC for accurate benchmarks of 8088 code, at least if you have a sufficiently strong host machine that MartyPC runs at (or near) real time.
mkarcherwrote on 2025-08-07, 19:50:I wonder how you do a single chunk per page. Assuming I want to transfer 48KB from a page-aligned address to an address that is […] Show full quote
I wonder how you do a single chunk per page. Assuming I want to transfer 48KB from a page-aligned address to an address that is offset by 8K. This means
low 8K from source page 1 to high 8K of destination page 1 (low 8K of destination page 1 is not in transfer range)
high 8K from source page 1 to low 8K of destination page 2
low 8K from source page 2 to high 8K of destination page 2
high 8K from source page 2 to low 8K of destination page 3
low 8K from source page 3 to high 8K of destination page 3
high 8K from source page 3 to low 8K of destination page 4
That is an excellent point. When I said I was doing 1 chunk per page I was thinking about XMS-to-RAM (or RAM-to-XMS). XMS-to-XMS transfers are trickier. I quickly looked at my code and I see that page-unaligned XMS transfers will likely break after the first page barrier. I am only half surprised since XMS-to-XMS transfers are a test that I have yet to add into my testing suite (along with RAM-to-RAM). So far I was extensively testing only XMS-to-RAM and RAM-to-XMS. It's good that there are still quite some version numbers available before 1.0 :-)
There is a simple assembler patterns to deal with odd lengths. (...)
1shr cx, 1 2rep movsw 3adc cx, 0 4rep movsb
This is brilliant! Did you come up with this yourself, or is it some kind of asm common knowledge? I am no assembly whiz so this is gold to me. In v0.9.5 I do movsw already, but only after testing cx for parity. Your solution is vastly superior as it uses movsw also for odd lengths, relying on movsb only for the last byte. Plus it has no branching.
Yes, the 8088 can not do word transfers on the bus, but the 8086/8088 has a quite slow microcoded loop for REP MOVS. A single iteration of REP MOVS is 17 clocks if you are transferring bytes. This is 8 clocks for data transfer and 9 clocks microcode overhead. If you are transferring words (any words on the 8088, unaligned words on the 8086) you get 4 additional clocks for the second bus cycle. This means a single iteration of REP MOVSW on the 8088 takes 9 clocks overhead + 2*8 clocks data transfer = 25 clocks. Two iterations of REP MOVSB on the other hand take 2*9 clocks overhead + 2*8 clocks data transfer = 34 clocks. This means REP MOVSW takes only 74% of the time used by REP MOVSB on the 8088
That's very interesting. The difference I measured on 86box is of only 1% in favor of movsw, though. Maybe because the emulation is not perfect, or maybe (more likely I guess) because the movsb/movsw difference is not significant enough to make a difference when compared to the overall EXMS86 overhead.
If you don't have a physical 8088 machine at hand, and need to resort to emulators for performance tests, I highly recommend using MartyPC for accurate benchmarks of 8088 code, at least if you have a sufficiently strong host machine that MartyPC runs at (or near) real time.
I wasn't disappointed by 86box so far, but MartyPC is definitely on my list of things to check out. It has been suggested to me before, because of its very interesting debugger.
That is an excellent point. When I said I was doing 1 chunk per page I was thinking about XMS-to-RAM (or RAM-to-XMS). XMS-to-XMS transfers are trickier. I quickly looked at my code and I see that page-unaligned XMS transfers will likely break after the first page barrier. I am only half surprised since XMS-to-XMS transfers are a test that I have yet to add into my testing suite (along with RAM-to-RAM). So far I was extensively testing only XMS-to-RAM and RAM-to-XMS. It's good that there are still quite some version numbers available before 1.0 😀
This also shows that XMS-to-XMS is not a common use case of the XMS API, because otherwise you would have noticed the issue already. I supposed you were considering XMS-to-XMS already, because you talked about different pages in the frame to be used for "source" and "destination", and if you just do XMS-to-conventional and conventional-to-XMS, a single page (either as source or destination) would suffice.
There is a simple assembler patterns to deal with odd lengths. (...)
1shr cx, 1 2rep movsw 3adc cx, 0 4rep movsb
This is brilliant! Did you come up with this yourself, or is it some kind of asm common knowledge?
I do a lot of reverse engineering on retro stuff. I did not design that fragment myself, but I think I've encountered it repeatedly. I can't attribute it to a specific source, and I suppose that pattern was invented independently multiple times in the 1980s.
I supposed you were considering XMS-to-XMS already, because you talked about different pages in the frame to be used for "source" and "destination", and if you just do XMS-to-conventional and conventional-to-XMS, a single page (either as source or destination) would suffice.
Absolutely, yes. XMS-to-XMS (and RAM-to-RAM) are two scenarios that I keep in mind since day 0. I tried to organize EXMS86 so it allows for these operations, but never actually tested them as it is a very niche usage that seems rarely (if ever) used. XMS-to-XMS and RAM-to-RAM will most probably be the focus of the next version, I just have to extend my test suite first.
I did not design that fragment myself, but I think I've encountered it repeatedly. I can't attribute it to a specific source, and I suppose that pattern was invented independently multiple times in the 1980s.
Very cool. I've added it already to the EXMS86 0.9.6 branch - works as expected, passes all my tests. Thanks for the tip!
I did not design that fragment myself, but I think I've encountered it repeatedly. I can't attribute it to a specific source, and I suppose that pattern was invented independently multiple times in the 1980s.
Very cool. I've added it already to the EXMS86 0.9.6 branch - works as expected, passes all my tests. Thanks for the tip!
I asked an AI system to find out whether this pattern is known. The AI system suggested an even better branchless variant using ADC CX, CX instead of ADD CX,0, which saves one byte (and one bus cycle / 4 clocks of fetching instruction bytes). It's likely that I misremembered the pattern, and [m}ADD CX, CX[/m] is more common.
The AI also suggested a branchful version as well, using JNC past_movsb; MOVSB; past_movsb:. Starting to discuss timing with AI was only moderately successful. The AI claimed that REP takes two clocks plus the repeated executions. So, according to the AI, REP MOVSB should take 2 clocks if CX is zero and 20 clocks if CX is one (a standalone MOVSB takes 18 clocks). This is wrong. My third-party reference indicates REP MOVSB as 9 + 17*CX, which is also possibly not proper. The original Intel Users' manual quotes 2 clocks for the REP prefix and 9 + 17*CX for repeated MOVS, but has no clear indication whether the two clocks to process the REP prefix are included in the 9 + 17*CX specification, or need to be added for a total time of 11 + 17*CX. The branchful version has the advantage of being another byte shorter.
Generally, writing short code is important on the 8088, because there is barely sufficient bandwidth on the bus to fetch the instructions fast enough. Consider for example my suboptimal suggestion ADD CX, 0, which is quoted as requiring 4 clocks to execute - but at an instruction size of 3 bytes, it takes 12 clocks (+bus delays) to fetch that instruction. On the other hand, this makes the slow execution of REP MOVSW not that bad. As there are 9 clocks per iteration in which the bus is idle, the time can at least be used to fill the prefetch queue.
In the end, (non-busmaster) ISA DMA was only used in cases where high throughput wasn't required, but background transfers were important. Sound cards fit this pattern perfectly. The programming model of ISA DMA is awful, and if EMM386 virtualizes RAM, it also needs to virtualize DMA, which is a very cumbersome operation. So nobody in the software industry liked ISA DMA. This explains why system designers were happy to get rid of the obselete ISA DMA scheme when they designed PCI systems. They did not expect the use case of "ISA compatible sound cards" being that important. Later on, some standards (eg. PC/PCI) were designed to give PCI sound cards access to the ISA DMA/IRQ system, and that's why PCI cards can't generally offer a nice Plug&Play experience including soundblaster compatibility.
That context is also helpful for those of us who wonder why ECP parallel ports--added at a fairly late date in the ISA bus timeline--bothered to include ISA DMA support.
The only other thing I have to add is that this OS2Museum post https://www.os2museum.com/wp/386max-and-eisa-dma/ talks about early Intel PCI chipsets having an EISA DMA controller even on systems with only ISA slots, and 386MAX tried to take advantage of that. Among other things, it meant no 16MB addressing limitation, but this support withered away by the time of Triton and other ubiquitous Intel chipsets.
That context is also helpful for those of us who wonder why ECP parallel ports--added at a fairly late date in the ISA bus timeline--bothered to include ISA DMA support.
I think ECP also includes a consideration I only touched slightly in that post, which is multitasking and background operation. I did write DMA was used where "background transfers were important", and while thinking about sound, this already applies to games in a single tasking environment, the idea of ECP was to integrate the slow parallel port transfers well into a modern desktop operating system like Windows 95, and make that port more generic. Note that ECP also includes device addressing for daisy-chained devices and does no longer focus to "printers only". I guess the latest incarnation of the parallel port ZIP drive did support ECP-type transfers. I would consider ECP to be a mostly failed attempt to introduce a general-purpose parallel bus (call it "UPB" - "universal parallel bus" if you want). If you ask for why ECP did not get major support in hardware, it's likely because the parallel cables were bulky and expensive. By the time ECP could have taken off seriously, we already had USB 1.1, and USB 1.1 solved all the issues ECP was intending to solve. Note how USB host controllers also include DMA support (as this is PCI, it's busmaster, because PCI no longer has a central DMA controller).
The only other thing I have to add is that this OS2Museum post https://www.os2museum.com/wp/386max-and-eisa-dma/ talks about early Intel PCI chipsets having an EISA DMA controller even on systems with only ISA slots, and 386MAX tried to take advantage of that. Among other things, it meant no 16MB addressing limitation, but this support withered away by the time of Triton and other ubiquitous Intel chipsets.
I actually do have an ISA/PCI Saturn II mainboard with an EISA-like DMA controller, the Asus PCI/I-486SP3G. Obviously, it does not support 32-bit transfers, as there are no EISA slots. It would have been a really interesting alternate history if that idea had taken off, and DMA controllers like that would have become standard on PC mainboard in 1994. But as Intel was the only vendor of chipsets lilke that while the low-end competition continued to use their 82c206-type "AT on a chip" ISA multi-function controllers, there never was a critical mass of EISA-DMA-capabable consumer PCs so that development of software for that feature made sense. This may be different in server environments. Note how the Adaptec 1522 OS/2 driver does support DMA transfers in EISA mainboards - but the drivers for all other systems do not.
^I remember that consumer class flat-bed scanners had supported LPT and USB connections at the time (both ports available on the back, often).
Professional scanners had used SCSI more often, I think.
So when I think of late ECP/EPP standards, then 8-Bit SCSI with D-Sub connector comes to mind.
About USB 1.x.. MacOS 8.6/9 supported it rather well already, but I guess that OS support on PCs was hit and miss by late 90s.
Windows for Workgroups, NT 3.5x/4 and Windows 95 RTM didn't support it, neither did OS/2 Warp 4, QNX or Linux.
Well, Linux sort of did, but I vaguely remember its USB support wasn't mature yet. Linux of the 90s wanted "good hardware" from the Sperrmüll.
Outdated 2D PCI graphics cards, industry standard PC hardware on motherboard, SCSI devices if possible.
Late 90s stuff such as newest PATA drives/PATA controllers, non-standard ACPI tables and power management in general wasn't best being supported yet.
Speaking under correction here, though. It's been 25+ years now.
"Time, it seems, doesn't flow. For some it's fast, for some it's slow.
In what to one race is no time at all, another race can rise and fall..." - The Minstrel
The AI system suggested an even better branchless variant using ADC CX, CX instead of ADD CX,0, which saves one byte
I should have thought about this. CX is guaranteed to be zero after rep, so it's an obvious opportunity to use it instead of a literal zero. I've changed it in EXMS86. It has no impact whatsoever on performances of course, since the rest of the code has a vastly greater overhead. But still nice to be optimal in this one tiny place. :)
The AI also suggested a branchful version as well, using JNC past_movsb; MOVSB; past_movsb:.
This one I had pondered about already. It is indeed a byte shorter, but it is branching. My understanding is that branching is better avoided because it forces the 8088 to flush its prefetch queue. I guess that a 1 byte saving does not make it up for the prefetch queue loss, the saving would probably need to be of a dozen bytes at least for the branch to be interesting. (?)
The AI also suggested a branchful version as well, using JNC past_movsb; MOVSB; past_movsb:.
This one I had pondered about already. It is indeed a byte shorter, but it is branching. My understanding is that branching is better avoided because it forces the 8088 to flush its prefetch queue. I guess that a 1 byte saving does not make it up for the prefetch queue loss, the saving would probably need to be of a dozen bytes at least for the branch to be interesting. (?)
It's not that bad. The prefetch queue of the 8088 is just 4 bytes. If you skip more than 4 bytes forward using a conditional jump, you already take pressure from the bus. A taken coditional jump takes 16 clocks (and I firmly assume that specification does not include the decoder stall due to the empty prefetch queue), so I would assume an effective time of 20 to 22 clocks for "jump taken". ADD CX,CX/REP MOSVB is 3 + 9 + 2? = 12 or 14 clocks, so yes, in this case, the branchless variant is definitely superior in case we are branching (the likely common case: even size).
In the odd case, the branching solution taken 4 clocks for JNC (not taken), and 18 clocks for a single MOVSB, which is 22 clocks in total. The branchless solution requires 3 clocks for the add instruction and 2? + 9 + 17 clocks = 29 or 31 clocks in total. A conditional jump that is not taken is obviously not very expensive, but the initial setup overhead of the repeated string instruction hits harder. The microcode overhead for the REP prefix might come in handy, as it ensures the prefix queue is completely filled after executing the REP MOVSB in the odd case. (2*9 cycles overhead while you only require 4*4 cycles to fill the queue). I'm unsure whether the 8088 manages to place the prefetches optimally (likely it does not), which might include extra stalls in the execution if the BIU (bus interface unit) is still busy prefetching when the MOVSB microcode decides to require a data transfer.
So, as even size is likely way more common (you quoted the XMS specification that asks for even size), optimizing for the even case is smart, which means the branchless version is better. If you were optimizing the odd-size case, the branching version would be preferrable.
Thank you for the fascinating explanations. I did not know that it mattered that much to an 8088 whether a branch is taken or not, that's precious knowledge that is not easy to find.
Thank you for the fascinating explanations. I did not know that it mattered that much to an 8088 whether a branch is taken or not, that's precious knowledge that is not easy to find.
You might want to take a look at the iAPX 86 User's Manual. A scan of a more recent version is available at https://ardent-tool.com/CPU/docs/Intel/808x/m … /210912-001.pdf . Starting at PDF page 40 (document page 1-24), there is the official Intel timing table. That table only considers the execution time of the instruction, and assumes the instructions are already prefetched and that the bus is available immediately when the execution engine tries to execute a transfer. As that manual also includes the 80188/80186, the timing table has the 808x timing followed by teh 8018x timing in parenthesis. They quote an average overhead of 5 to 10 percent over the timings given in the table, which is referring to processors with a 16-bit bus interface. Expect that number to be higher on the 8088, as the 8086 has twice the bandwidth when fetching instruction from 0WS 16-bit memory (which is assumed in the 5-10% estimate).
I've seen an Intel datasheet specifically for the 80186/80188, which warn you that "the prefetch queue may not be sufficiently filled most of the time on the 80188, and execution time may be significantly longer than what you get by adding up the clock cycles given in the table" due to the low bandwidth of the 8-bit bus interface unit (paraphrased from memory, emphasis added by me). I used to think that this warning was more severe in the 80188 datasheet than it was in the 8088 datasheet, because the 80188 got a more modern execution unit, but looking at that table, the 8018x execution unit does not seem generally faster. The 8018x execution unit certainly shines on microcode-heavy instructions like MUL or REP MOVSB, but it is comparably fast on simple arithmetic instructions. Basically, the 8-bit bus interface unit of the 8088 is a bottleneck in most typical PC applications, and chosing the 8088 over the 8086 is a deliberate choice for a less complex mainboard (less data lines, no need to translate 16-bit access to 8-bit access) over performance. IBM recognized this issue as well, and the PS/2 model 30, which can be considered an "XT 2.0" did contain the 8086 and 16-bit main memory instead.
A final remark on the 8088/8086 distinction: People often quote that the 8088 prefetch queue is 4 bytes, and the 8086 queue is 6 bytes. This is "mostly correct", but to be more precise, the 8086 queue is to be considered as "3 words" instead. While it may seem strange at first that the 8088 with its more limited bus interface has a shorter queue than the 8086 (usually, more cache helps agains bus bottlenecks), this is documented to be a deliberate design choice based on measurements. You do not only have to consider stalls due to the queue being empty (which get less if the queue is longer), but also stalls because the prefetcher and the execution unit contend over the bus. The execution unit only gets guaranteed immediate bus access when the queue is full, which reduces execution performance while the queue is not full. This is especially wasteful if the processor over-prefetches over a conditional jump that is taken, because those prefetched bytes/words are thrown away without ever being looked at. Intel determined that the performance loss due to over-prefetching exceeds the performance gains due to less execution stalls on an empty queue on the 8086 if the queue gets longer than 4 bytes. And if you look at the queue length in terms of bus cycles required to fill it, the 4-byte queue actually is longer than the 3-word queue.
Did a test today with a 286/12Mhz with a PicoMEM card (for EMS) using version 0.9.5. It loads okay with all options.
But after loading it the command mem does only lists an empty line (so no memory information). ChecktIt Pro crashes when starting it (on the XMS part). Norton System Information 8.0 also hangs when trying to show XMS information. Other tools like Astra and NSSI report only 384Kb XMS and 4Mb EMS. The 384Kb is probably the real memory from this PC (which has 1Mb memory and 4Mb EMS). It's the same amount as when I'm using himem.sys. Wolfenstein 3D has both lines maxed out for XMS and EMS. Wolfenstein 3D starts okay but hangs after a minute or so.
But after loading it the command mem does only lists an empty line (so no memory information). ChecktIt Pro crashes when starting it (on the XMS part). Norton System Information 8.0 also hangs when trying to show XMS information.
Thanks for the feedback! 3 questions:
1. What DOS are you using?
2. What are the exact options you load EXMS86 with?
3. What is your EMS card & driver?
Also, I attach here the v0.9.6 beta. I fixed a few issues since 0.9.5, maybe some are related to what you observe, although it is hard to say. Any chance you could verify if the symptoms are the same with this version?
Other tools like Astra and NSSI report only 384Kb XMS and 4Mb EMS. The 384Kb is probably the real memory from this PC (which has 1Mb memory and 4Mb EMS).
The sounds like you might have HIMEM loaded, an XMS driver for your "real XMS". I recommend you to try without HIMEM first. It's quite possible that EXMS86 is incompatible with another XMS driver at the same time. Not loading HIMEM also means you can not put DOS into the HMA ("DOS=HIGH"), so this will cause less conventional memory being free.
This beta works a lot better. Mem is working. CheckIt Pro and Norton System Information 8.0 don't hang or crash anymore. They both show 4MB EMS and 4MB XMS. NSSI shows 320KB XMS 2.0 and 4MB EMS. Astra shows no extra memory above 1MB.