Tiido wrote on 2024-04-09, 09:55:
DOS text mode output (80x25) can be done with just 230kB/sec
Not really. On the PC, there is no "line cache", so each text line needs to be retrieved for each scan line of the character. For example, on CGA, in high-res text mode, a character takes 8 pixels at 14.318MHz pixel clock, which is 1.8 megacharacters/second. You need to read a character byte and an attribute byte per character, so the required memory bandwidth for that mode is 3.6MB/s. The CGA card has an 8-bit memory bus (8 16k x 1 chips at U50..U57) on this PCB revision, so this requires a memory cycle every 280ns. The installed RAM chips have a row access time of 120ns, and a read cycle time of 270ns, so obviously, in high-res text mode, the full memory bandwidth is required for scanout. That's the reason why you get snow when the CPU accesses memory during scanout: In that case, a memory cycle that would be required for correct scan-out is replaced by the cycle requested by the CPU, and the scan-out logic latches the CPU data instead of the image data.
In all other modes, the CGA card only needs a memory cycle every 560ns, and only uses every second cycle slot. Every other cycle slot would be available for CPU use - but the 8088 can't generate bus cycles fast enough to achieve the theoretical CGA bandwidth of 1.8MB/s. There are videos on Youtube with people running CGA cards at highly overclocked ISA busses, and IIRC they managed to get to every second "bus slot" actually used, which is 900KB/s, which is still way faster than the XT could do.
The MDA also operates at 1.8 megacharacters per second, but with a 9 pixel character box. Yet the MDA has no snow issues. This is not just because it uses static RAM with a cycle time of 200ns, but mainly because it has an internal memory bus width of 16 bits instead of 8 bits, so only 1.8 megacycles per second are required.
With EGA, the bandwidth requirement increased slightly by operating at 2 megacharacters per second in 640x350 mode. On the other hand, the EGA now had software loadable fonts in the same video RAM. Due to very clever design, IBM managed to stil get good performance for that time: The EGA card has 32-bit memory access, which (as I understand it) can be split into two independent 16-bit busses. This split is used in text modes, with one 16-bit bus serving characters and attributes, and the other 16-bit bus serving character data. So for EGA, the 2 megacharacters per second require 2 megacycles per second. If I remember correctly, high-res text mode uses a configuration with 80% of the memory bandwidth allocated to scanout and 20% to the ISA bus, so the memory obviously needs to handle 2.5 megacycles per second, which is less demanding than on the CGA.
On the VGA card, bandwidth requirement increased, because the scanout requires 3.2 megacharacters per second, which translates to 6.4MB/s, or 3.2 megacycles per second. As with the EGA, the VGA also allocates 20% of the bandwidth to the bus, so the cycles run at 4 megacycles per second, which is just slightly higher than CGA, but on the VGA, only 800KB/s can theoretically be used on the ISA bus, assuming you can issue cycles fast enough.
A reference claims that VGA allows all memory access time slots to be used by the bus during blanking, so this would allow a whopping 4MB/s - but obviously the 8-bit ISA bus is nowhere near that speed, so it can not claim all the theoretically available access time slots.
Tiido wrote on 2024-04-09, 09:55:
VGA mode13h (320x200@70Hz 8bpp) takes ~4.4MB/sec which isn't so bad and 640x480@60Hz with 4bpp taking ~9MB/sec, which is a significant portion of the rather low memory bandwidths of the hardware of the time.
You missed the fact that VGA mode 13h is double-scanned at 320x400, so the bandwidth requirement is identical to the the 640x480 mode. At a dot clock of 25.175MHz and 8 pixels per 32-bit access on the VGA, this is a cycle rate of 3.14 megacycles per second, or a scan-out rate at 12.6 MB/s. The deviation from the 9MB/s is that I calculated the required bandwidth during the active period, while you seem to have calculated the average bandwidth including blanking. The VGA card works perfectly in an PC/XT with an on-board memory bus, and can pull of this high data rate only due to its 32-bit memory bus.
Tiido wrote on 2024-04-09, 09:55:
Continuous process that requires opening pages is what kills most of the achievable bandwidth, since it is the most slowest thing one can do in a DRAM system, even on modern memories opening a new page can take hundreds on ns
Exactly, opening/closing pages is an expensive operation, and that's what (fast) page mode DRAM allows to avoid. All classic video cards I mentioned up to here do not use page mode. Remeber that all cards allow interleaving of CPU and scanout cycles, and while you might be able to guarantee page hits during continous video scanout cycles, every CPU cycle and the first video cycle after that is a potential page miss. When the CGA card was designed, page mode access was not even supported by the common RAM chips of the time. While the MDA used static RAM (and discussion of page mode makes no sense regarding the MDA), the Hercules card is DRAM based and does not use page mode as well. The Hercules card has a very peculiar design: It's 64KB DRAM uses an 8-bit bus just as the CGA does, but it has a 2K x 8 SRAM used as mirror RAM for attribute data. This explains why you can get multiple text pages on the CGA, but not on the Hercules, even though it has more RAM than the CGA card.
Tiido wrote on 2024-04-09, 09:55:
and it is a loss that can only be avoided by interleaved memory bank accesses, assuming the memory access patterns permit such. Not impossible, since it has been successfully, but has many gothcas and will always come with a significant performance difference.
The best way to avoid bank open/close time is to use fast page mode RAM, and to run "bursts" that mostly hit the page. That's the improvement that made the jump from ~500-800KB/s usable bus bandwidth of early VGA/SVGA cards to multiple MB/s. "Modern" SVGA cards had a FIFO for display data that gets filled using page-mode (burst) reads, and a second FIFO that buffers CPU writes, and can often be drained using page-mode (burst) writes (if the CPU writes consecutive addresses). Due to the FIFOs, the hard real-time requirement on memory access is mitigated, which is the prerequisite to allow burst writes from the bus. Read from the video memory to the CPU will not be prefetched to any FIFO, and can not be posted into a buffer, so the read performance did not get the same upshot as the write performance on that generation of cards, wich makes screen-to-screen copies via the bus annoyingly slow, so this task is best shifted to an accelerator (mentioning this brings this post back on topic. Yeah!)
A final remark: The IBM MCGA video system managed to pull off the stunt of displaying 320x200 in 256 colors on VGA monitors using the same timing as the VGA card, so it requires the same 12.6MB/s data rate. Yet, the PS/2 model 30 just has 2 memory chips, each 64K x 4, which is an 8-bit bus. This requires a scan-out cycle time of 80ns. How did IBM pull that off? Easy... at the time the MCGA was designed, dual-ported VRAM got affordable, at least at that capacity. While most people associate VRAM with high-performance graphics systems, IBMs entry-level PS/2 graphics solution also relied on VRAM to provide sufficient memory bandwidth to display that colorful mode.