VOGONS


First post, by migry

User metadata
Rank Newbie
Rank
Newbie

So the story is, I have designed and built a 8088 PC compatible PCB, using the standard I/O chip set (8259 ICU, 8253 Timer, 8255 PPI) and I am currently debugging the system. I use SRAM so no need for complicated refresh using DMA. For simplicity I omitted DMA, hoping that I could read the floppy using polled I/O. I then discovered that the Tandy 1000 does just that. I have used a uPD765 as a floppy controller, and after some effort now appears to be working (double density only).

I also put the 8088 into minimum mode, as this simplifies the design, and removes the need for the 8288 bus controller, which is difficult to find (and expensive if found). My understanding is that in minimum mode the 8087 cannot be used. Looking at the original IBM PC design , which puts the 8088 into maximum mode the 8087 connects to the RQ/GT1 of the 8088, and the RQ/GT0 is tied to VCC. Since I have the 8088 in minimum mode, the two RQ/GT pins are HOLD/HLDA which I use to float the CPU in order to load the RAM.

So it appears that the IBM PC does not use the normal CPU float mechanism (HOLD/HLDA or RQ/GT) for DMA. So exactly how is DMA achieved?

The logic associated with WAIT/RDY and DMA is quite complicated, one reason why I decided not to implement DMA. My best guess it that when there is a DMA request, the bus cycle is paused using RDY and tristate buffers remove the CPU from the bus. When the DMA cycle completes, I assume that the paused read/write completes? I do see a signal called "DMA_WAIT" going to the RDY1 pin of the 8284 clock generator, which is why I suspect this mechanism is used to implement a psuedo-DMA. Are there any document with waveforms out there? Seems pretty nasty to me 😉 if so.

I am particularly interested in how floppy DMA works. The best I can figure out is that each byte from the uPD765 FDC uses this mechanism. I note that the DRQ output from the FDC passes through a 4 flop shift register to implement some kind of delay, but why? Looks like the uPD765 (based on an Intel design?) is the only FDC which is suited to the PC DMA mechanism.

I also understand that the floppy DMA uses "fly by" where the FDC read and memory write are asserted at the same time. So I can see how this works to transfer between memory space and I/O space, but can the DMA also do memory to memory transfers, which must use separate read and write cycles.

On the next iteration of my PCB for the 8088 CPU, I was wondering if I could add the DMA controller and page register IC, but simplify the design by using HOLD/HLDA, rather than the read/write pause mechanism (if this is actually the way DMA is done). I'm not interested in adding support for the 8087, although ironically I have 3 of these devices kept from the 80's when rescused from old equipment. I hope to sell them, although I really need to test them first.

Reply 1 of 23, by mkarcher

User metadata
Rank l33t
Rank
l33t

I'm sorry that I can not answer you main question about the ISA DMA implementation of the IBM PC. I also tried to understand the schematics, and came to similar conclusions: DMA cycles interrupt CPU bus cycles, and delay asserting ready to the 8288, which generates the RDY signal for the CPU, at least that how I understand it.

ISA DMA always works in a "fly-by" mode. You get proper memory read/write signalling on the ISA bus during DMA cycles. At the same time, you get DACK asserted to tell the DMA'ing device that it is being serviced, but that's not all. A device-to-memory transfer does not only put the memory address to the ISA bus, asserts /MEMW, asserts DACK, but it also assert /IOR! This is a problem if a card just decodes the address on the ISA bus and /IOR to output some data to the bus. During a DMA cycle, /IOR is asserted, but a memory address is at the same time. You must not interpret that address as I/O address. The solution to this problem is the AEN signal: During DMA cycles, the AEN signal indicates that you must not respond to I/O addresses.

You are right that memory-to-memory DMA is not implemented on the Intel 8237. But obviously, it can't put both the source and the destination address on the bus. The idea of the 8237 memory-to-memory DMA is that you program DMA channel 0 as "read" channel and DMA channel 1 as "write" channel. The data is buffered in an 8-bit latch in the 8237 DMA controller. The DMA controller first issues the DMA0 read, then latches the data, and then issues the DMA1 write and puts the latched data back to the data bus. The AT introduced 16-bit channels by creatively wiring the second DMA controller to the bus, but that doesn't enable 16-bit memory-to-memory transfer, because that chip only has an 8-bit latch, and it is only connected to D0-D7. It can't latch D8-D15, which you would need for 16-bit memory-to-memory transfers.

You should be fine using HOLD/HLDA to connect a 8237 to a 8088 in minimum mode.

Reply 2 of 23, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

Not waveforms exactly, but similar... attached are some logs from my ISA bus sniffer showing the exact timings of some DMA cycles and how they interact with CPU bus accesses. There's some more information about interpreting these logs at https://www.vcfed.org/forum/forum/genres/pcs- … -sniffer-device . The DMAs here come from the DRAM refresh circuitry (I can't really capture FDC DMAs with this setup as I need to be able to reproduce the same timings repeatedly). But as far as the DMAC and CPU are concerned it shouldn't make a difference what device initiates the DMA. Except possibly for inserting wait states into the DMA access - I'm not sure if the FDC DMAs do this or not but the DRAM refresh DMAs don't (unless maybe you program the page registers for CGA RAM or something). Please let me know if this is useful, or if you need more information to be able to make use of it. It's a while since I've tinkered with this stuff but some it might come back to me if there's a specific question you ask me to think about.

Attachments

  • Filename
    waits2.txt
    File size
    20.2 KiB
    Downloads
    67 downloads
    File license
    Public domain

Reply 3 of 23, by rasz_pl

User metadata
Rank l33t
Rank
l33t
migry wrote on 2021-12-01, 13:48:

but can the DMA also do memory to memory transfers, which must use separate read and write cycles.

http://www.os2museum.com/wp/the-danger-of-datasheets/
>IBM PC ... 8237 supports memory-to-memory transfers using DMA channels 0 and 1 (and only those).
>inability to use separate page registers for the source and destination since memory-to-memory transfers don’t use DACK

http://www.os2museum.com/wp/more-fun-with-isa-dma/
> block transfers are the only type that can be used for memory-to-memory DMA

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 4 of 23, by kdr

User metadata
Rank Member
Rank
Member

As far as I'm aware you can't use the 8237's memory-to-memory feature on the IBM PC. When DACK0 is asserted, the IBM PC/XT motherboards are wired up so that this triggers a /RAS on *all* of the memory banks simultaneously and also skip the /CAS entirely. None of the memory chips output any data during this refresh cycle, so even if you reconfigure the 8237 for memory copy mode, there's nothing on the data bus for DMA channel 0 to read.

Here's a good explanation (with diagrams) of how the memory refresh circuitry works on the original IBM PC: http://www.minuszerodegrees.net/5150/ram_refr … M%20refresh.htm

Reply 5 of 23, by mkarcher

User metadata
Rank l33t
Rank
l33t
kdr wrote on 2021-12-02, 20:09:

As far as I'm aware you can't use the 8237's memory-to-memory feature on the IBM PC. When DACK0 is asserted, the IBM PC/XT motherboards are wired up so that this triggers a /RAS on *all* of the memory banks simultaneously and also skip the /CAS entirely. None of the memory chips output any data during this refresh cycle, so even if you reconfigure the 8237 for memory copy mode, there's nothing on the data bus for DMA channel 0 to read.

Yeah, that's definitely right. You can't use DMA0 on the IBM PC/XT, and without "normal" DMA0 operation, you don't get memory-to-memory opeation. But IIRC you can use 8-bit memory-to-memory DMA on DMA0/DMA1 on the IBM AT. It's pointles, though, as it is likely the most ineffective way to do memory-to-memory copies available on the AT.

Reply 6 of 23, by migry

User metadata
Rank Newbie
Rank
Newbie

@mkarcher - thank you. I have always seen AEN used to decode port addresses, but never really understood what it did.

@reenige - thak you. The waveforms are interesting, although I'm not clear what each column is tracing.

@rasz_pl - thank you. Very interesting links. Seems to confirm that memory to memory transfer via DMA channel 0 and 1 is not really possible, which the last two posters confirm.

I considered creating a 8088 system design in verilog, but I can't find a cycle accurate model of the CPU. I was going to add the DMA circuitry to help understand how it worked.
I have now bought a D8288 bus controller and another 8088. I will put together a maximum system and use a logic analyser to trace the CPU waveforms...hopefully.

Reply 7 of 23, by kdr

User metadata
Rank Member
Rank
Member
migry wrote on 2021-12-03, 17:32:

I considered creating a 8088 system design in verilog, but I can't find a cycle accurate model of the CPU. I was going to add the DMA circuitry to help understand how it worked.
I have now bought a D8288 bus controller and another 8088. I will put together a maximum system and use a logic analyser to trace the CPU waveforms...hopefully.

You might want to check out reenigne's 8088 bus traces:

https://www.reenigne.org/blog/isa-bus-sniffer-update/

The 8088 can use otherwise-idle bus cycles to prefetch up to 4 bytes of instructions, and DMA refresh cycles can occur practically anytime, so instruction timings aren't very deterministic. If you study the traces (reenigne captured traces of the 8088 executing its entire instruction set) you'll quickly get a feel for how the CPU performs its bus cycles and how the DMA can preempt it.

Reply 8 of 23, by migry

User metadata
Rank Newbie
Rank
Newbie

@kdr thank you. I followed the link and found the following (which I was previously unaware of)...

"From left to right the columns are:
CPU address/data,
CPU flags,
bus address,
bus data,
DMA requests and acks,
interrupt requests,
bus flags,
bus states for CPU and DA accesses,
transfers,
prefetch queue status,
instruction data and
decoded instruction".

It should now be easier to interpret the data in the wait2.txt file pointed to by @reenige.

Nevertheless I am old school and I look forward to wiring up a new 8088 board in order to view the signals on oscilloscope and logic analyser. Of course I will have to add the DMA RDY logic...

Reply 9 of 23, by migry

User metadata
Rank Newbie
Rank
Newbie

Just FYI, on my own minimum 8088 PCB I use the RDY inputs of the 8284 clock generator to implement single stepping. When RDY is held low, the CPU stalls in what appears to be the T3 state (I find the waveforms in the 8088 datasheet quite unclear), and the RD and WR strobes are active (low). Raising RDY allows the access to go to T4 and complete. While in the wait state I am free to look at the (latched) address bus, data bus, and strobes to see what access is happening.

The DMA RDY logic in the IBM XT 5150 decodes S0,S1,S2,LOCK and DMA HRQ from the 8237 all high, to assert a signal called HOLDA generated from a 74LS74 flop in an unusual configuration (Qbar connects to notPR).

In maximum mode S0,S1, and S2 all high is the "Passive" state. I'm unclear exactly what this is and when it might occur.

I couldn't figure out if the memory or I/O cycle which is paused by "DMA WAIT" gets split into two parts, with the DMA access in the middle, i.e. the memory or I/O access starts and is paused, the MEMRD and MEMWR strobes are then taken over by the DMA chip and pulsed depending on DMA access direction, de-asserted, then strobe control is given back to to the CPU (bus controller), "DMA WAIT" gets de-asserted and the bus cycle completes. I'm not sure since I have no idea what the "Passive" state is, but if this refers to the non active cycles when the bus is not doing an access, then I would have assumed(!) that RDY wouldn't have any affect. OK datasheet says "Passive" state = "No Bus Cycle".

"STATUS: is active during clock high of T4, T1, and T2, and is returned to the passive state (1,1,1) during T3 or during Tw when READY is HIGH".

I'd really like to see some waveforms, as I am still puzzled. 😀

Reply 10 of 23, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

From the point of view of the CPU, an IO cycle (memory or port) can be split into two parts by a DMA - it looks like wait states are inserted are inserted between T3 and T4. From the point of view of the IO device, IO cycles are not split.

Reply 11 of 23, by FrankieKat

User metadata
Rank Newbie
Rank
Newbie
migry wrote on 2021-12-03, 17:32:

@rasz_pl - thank you. Very interesting links. Seems to confirm that memory to memory transfer via DMA channel 0 and 1 is not really possible, which the last two posters confirm.

Now just to clarify, is it that memory to memory transfer via DMA 0 to 1 is not technically possible at all or that it isn't practical since it would stop DRAM memory refresh via channel 0?

From the article http://www.os2museum.com/wp/the-danger-of-datasheets/:

"You can do memory-to-memory transfers, or you can keep the DRAM refreshed, but not both"

"The new article clearly states that memory-to-memory DMA is not possible on the IBM PC because DMA channel 0 is used for DRAM refresh"

Accepting the caveats that you'll likely lose the contents of your DRAM and the page register for DMA 0 and 1 are "effectively" shared (per the article) so source and destination are limited to the same segment, I'm wondering if it's technically possible on the PC/XT architecture.

I've tried this myself and have not had success, though it's possible my code is the issue. Has anyone come across any working code examples of this on a PC?

FK

Reply 12 of 23, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

I did (just about) get memory-to-memory copies working on my XT at one point. It can be done without losing DRAM contents by speeding up the refresh rate until all rows are recently refreshed, then turning off refresh and doing the memory-to-memory copy, then turning refresh back on (still at the fast rate) until all rows are recently refreshed again, then slowing the refresh rate back down to normal. But it's such a hassle for such a minor speedup that there really aren't any situations where it's worth it.

Reply 13 of 23, by mkarcher

User metadata
Rank l33t
Rank
l33t
reenigne wrote on 2022-05-30, 16:41:

I did (just about) get memory-to-memory copies working on my XT at one point.

Looking at the 5150 schematics (original 64K board), the same page register is shared between channel 0 and 1, so memory-to-memory transfer is only possible within the same 64K block. Is that different on the XT, or is that another limitation that also affected your approach?

Reply 15 of 23, by Jo22

User metadata
Rank l33t++
Rank
l33t++
reenigne wrote on 2022-05-30, 16:41:

I did (just about) get memory-to-memory copies working on my XT at one point. It can be done without losing DRAM contents by speeding up the refresh rate until all rows are recently refreshed, then turning off refresh and doing the memory-to-memory copy, then turning refresh back on (still at the fast rate) until all rows are recently refreshed again, then slowing the refresh rate back down to normal. But it's such a hassle for such a minor speedup that there really aren't any situations where it's worth it.

That's clever and really fascinating, but also akin to jumping on one feet to get from A to B when the second feet is broken.
If the XT system had Static RAM or Pseudo Static RAM, the refreshing task could be omitted completely, I suppose.
Makes me wonder why people still build XT clones with both the 8088 and cumbersome DRAM. 🤷‍♂️
We're not living in ~1986 anymore, after all.
Replacing both by slightly more modern parts (V20, 8086/V30; SDRAM/PSRAM) would get away with bottlenecks without loosing much of the original character. 🙂
The 8-Bit SBC (single board computer) community does make use of newer tech, too, afaik. They make functional CP/M systems with just ~5 ICs or so. 🙂

"Time, it seems, doesn't flow. For some it's fast, for some it's slow.
In what to one race is no time at all, another race can rise and fall..." - The Minstrel

//My video channel//

Reply 16 of 23, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

For me, part of the fun of these old machines is exploring these weird corners as they actually are rather than as they would be in a more modern take on the system. In that respect, using a different CPU or not having DRAM refresh *would* lose the original character.

Reply 17 of 23, by mkarcher

User metadata
Rank l33t
Rank
l33t
reenigne wrote on 2022-05-30, 19:58:

Yes, that's another severe limit on its usefulness!

Hmm, assuming we can deal with RAM refresh some other way for now, a possible use-case for memory-to-memory DMA could be CGA emulation on Hercules. The best-known emulation method uses 300 active lines in three banks on the hardware side, but CGA software only fills the first two banks, so every third line is black. Some CGA emulators copy the second bank to the third bank on every n'th timer tick to lessen the scanline effect. Offloading that copy to the DMA controller (compared to REP MOVSW) might actually help, and it stays within the same page for sure.

RAM refresh is every 15,6µs to get a full 8-bit refresh cycle every 4ms. The HGC graphics mode runs at a memory fetch clock of 2MHz. As CRTC/ISA memory access slots are interleaved 1:1, it also provides a theoretical memory access slot to the ISA bus (processor or DMA) controller at 2MHz. The 8088 bus protocol with 4 4,77MHz clocks per memory cycle obviously can't make use of every bus slot, but the DMA controller should at least be able to consistently hit every second bus slot, i.e. 1MHz. A CGA bank is 8KB, and if we want to copy a CGA bank, we need to read and write 8KB, i.e. transfer 16KB over the bus. This will require 16ms, i.e. 4 full refresh cycles of the RAM, assuming it is 8-bit-refresh RAM. 7-bit-refresh-RAM needs a refresh to every of its 128 rows every 2ms, so a single CGA bank copy takes 8 full refresh cycles. This tells me to stop assuming and drop the idea. Too bad. If we need to split the memory-to-memory copy into 8 parts and run 128 refresh cycles between each part, maybe the MOVSW performance is not that bad, too.

On my Turbo XT board (i.e. not an original IBM PC/XT board, but very similar), I remember to get the refresh logic work properly at a clock divider of 2 (referenced to the timer clock, which is 1/4 of the system clock), i.e. a refresh every 1.68µs (596kHz). So 128 refresh cycles would take 256 timer clocks, i.e. 1024 processor clocks. That's just one extra processor clock per byte copied. Not too bad. If we consistently hit every other bus slot, at 1MHz, we need 2µs per bytes transferred, which is around 9 processor clocks. The Mem-to-Mem-DMA copy variant will thus clock in at around 10 processor clocks per byte. REP MOVSW is specified at 25 clocks per word which is 12.5 processor clocks per byte. Locking into the HGC bus slots and getting the bus torn away for refresh ever so often, I expect to see around 15 clocks per byte. So mem2mem DMA would yield a 33% performance increase! Even if we add that we also need an extra refresh round at the start, 20%-30% performance gain isn't that bad.

Instead of using mem2mem DMA, you could also replace the 8088 by a V20, which should achieve the same speed as the DMA controller for REP MOVSW...

Reply 18 of 23, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

When accessing CGA RAM there are also wait states (whether those accesses are initiated by the CPU or DRAM controller). Almost certainly those wait states would erase any advantage of the DMA compared to the 8088.

Reply 19 of 23, by FrankieKat

User metadata
Rank Newbie
Rank
Newbie
mkarcher wrote on 2022-05-30, 21:58:

Hmm, assuming we can deal with RAM refresh some other way for now, a possible use-case for memory-to-memory DMA could be CGA emulation on Hercules. The best-known emulation method uses 300 active lines in three banks on the hardware side, but CGA software only fills the first two banks, so every third line is black. Some CGA emulators copy the second bank to the third bank on every n'th timer tick to lessen the scanline effect. Offloading that copy to the DMA controller (compared to REP MOVSW) might actually help, and it stays within the same page for sure.

I did find this article (chapter 13, page 502) that discusses using DMA mem-to-mem to clear video RAM. It has code examples for doing mem-to-mem copy (similar to REP MOVSB) and also mem-to-mem filling (similar to REP STOSB) by using DMA 0 to read and DMA 1 to write, where in the latter case using "Channel 0 hold" bit on the Command byte. This doc is a bit contradictory because it describes this as a way to clear video in DOS, but later says "this program is designed to function with the hardware in Figure 13–12 and will not function in the personal computer unless you have the same hardware". I'm also not convinced this code was actually tested because it only sets the counter for channel 0, where channel 1 (write) never actually increments so not clear how it would work.

Using the pseudocode example in "Danger of Datasheets" I was able to get mem-to-mem DMA write (channel 1 only) to work writing an indeterminate value to RAM using 86Box, however it did not work in PCem, pce or DOSBox (I don't have access to real hardware at the moment).

	MOV	AL, 0101B			; mask channel 1
OUT 0AH, AL ; port 0AH set single channel mask
XOR AL, AL ; AX = 0000H (base address for DMA write)
OUT 0CH, AL ; port 0CH clear F/L flip/flop
OUT 02H, AL ; port 02H send low address byte (00H)
OUT 02H, AL ; port 02H send high address byte (00H)
MOV AX, 1FFH ; set counter 200H bytes
OUT 03H, AL ; port 03H send low counter byte (0FFH)
MOV AL, AH ; AL = high byte
OUT 03H, AL ; port 03H send high counter byte (01FH)
MOV AL, 4 ; set page register to seg 4000H
OUT 83H, AL ; port 83H set page register ch 1
MOV AL, 10000101B ; set channel 1, Block Mode, Write
OUT 0BH, AL ; port 0BH DMA mode reg
MOV AL, 0001B ; unmask channel 1
OUT 0AH, AL ; port 0AH set single channel mask
MOV AL, 0101B ; set Request bit channel 1 (for mem-to-mem)
OUT 09H, AL ; port 09H DMA start request

There is a comment on that article that implies that it may be technically possible that reprogram channel 0 without losing DRAM refresh:

the DRAM is refreshed if something (the CPU or DMA controller) does reads with the least significant address bits set for each combination of a certain range (0-0x7F or 0-0xFF IIRC) within the time period refreshes has to be done. So reprogramming the DMA controller to do a memory-to-memory transfer would ensure that memory is refreshed if the transfer size is large enough (128 or 256 bytes)

There is one outlier use case for this that I thought of. In an XT BIOS during cold boot, one could use the DMA controller to write values to all RAM address space, just prior to starting RAM refresh. This would reset the parity bits (so you can do all RAM tests with parity enabled) while testing DMA controller at the same time and simultaneously executing other ISA chipset inits that otherwise require wait delays. A bit of a far-fetched use case to be sure, but maybe it could work!

FK