VOGONS


First post, by spacesaver

User metadata
Rank Newbie
Rank
Newbie

I just got an AWE64 together with a SIMMCONN memory upgrade. I tested the Chorium sound font, which sounds great, but can't believe it takes 61s to upload ! The uncompressed size is 28033 KiB as shown by the memory free bar. That's only 460 KiB/s. This was on a socket 5 Pentium with 66 MHz bus. This AWE32, tested by Phil, gets 605 KiB/s https://youtu.be/FZnfl1PN2Fo?t=513.

How come it's only getting ~3% of the ISA peak? If the bus is saturated by other traffic, that can happen, but I doubt there's anything else using the bus besides for reading from memory and writing to ISA.

Reply 1 of 7, by Disruptor

User metadata
Rank Oldbie
Rank
Oldbie

Hi, I've done the calculations here:
Re: Awe32 RAM Expansion

Reply 2 of 7, by mkarcher

User metadata
Rank l33t
Rank
l33t
spacesaver wrote on 2025-02-13, 15:47:

I just got an AWE64 together with a SIMMCONN memory upgrade. I tested the Chorium sound font, which sounds great, but can't believe it takes 61s to upload ! The uncompressed size is 28033 KiB as shown by the memory free bar. That's only 460 KiB/s. This was on a socket 5 Pentium with 66 MHz bus. This AWE32, tested by Phil, gets 605 KiB/s https://youtu.be/FZnfl1PN2Fo?t=513.

How come it's only getting ~3% of the ISA peak? If the bus is saturated by other traffic, that can happen, but I doubt there's anything else using the bus besides for reading from memory and writing to ISA.

It's way more than 3%. 3% would mean that the bus capacity is around 30 times the rate you experience, so around 14MB/s. The theoretical maximum for 0WS cycles is 8.3 MB/s. 0WS cycles for mainly meant for memory, for I/O you will get one wait state at least, so down to 5.3MB/s. Many pentium boards do not care about ISA I/O speed, and more about compatiblity with slow cards, and offer wait states (I/O recovery time, check for that option in your Setup!) choices of something like 2, 4, 6, 8 or even higher like 2, 4, 8, 12. At 2WS, the theoretical maximum is 4MB/s, so you are at 10 to 11% of the bus capacity.

Indeed, the main issue is that the EMU8K chip operates on a very strict schedule. It processes 32 "voices" for every sample of the 44.100kHz sample rate it runs at. While processing a playback voice, it will read a certain amount of 16-bit samples from the RAM (to allow interpolation, likely 4 consecutive samples). Depending on the AWE32 model, some time slots ("voices") are reserved for getting samples from the FM/CQM synthesizer into the EMU8K chip, so it can apply Chorus and Reveb to it, and another voice is reserved to generate memory access patterns that refreshes all rows of the DRAM often enough. For uploading sound fonts, some of the time slots get re-purposed to transfer data to the sound card RAM instead of playing back data from the sound card RAM. The Windows driver for the AWE32 uses up to 10 of the 32 available time slots. The EMU8K has a buffer for 16 bits, IIRC, and if that buffer is filled when a slot that is set to "sample upload" arrives, it puts those 2 bytes into sample RAM. The processor needs to check a status bit to know whether the transfer took place and then re-fill the buffer. This means that the theoretical limit of 4MB/s can not be reached on the AWE32, as there needs to be polling for the readiness in-between, which consumes bus bandwidth as well. Furthermore, 10 time slots within a 44.100th of a second means a maximum transfer rate of 441000 words per second, which is 882000KB/s (the figure referenced by Disruptor in the previous post). This rate can only be achieved if the processor never misses a slot. Your rate seems to indicate that on your board only every second transfer time slot is acutally used, so you really might want to check whether you have the I/O recovery time set that high that every other time slot is missed by the upload procedure.

Reply 3 of 7, by spacesaver

User metadata
Rank Newbie
Rank
Newbie

This is pretty fascinating. I thought I knew a lot of computer architecture, but I didn't realize the ISA bus is so inefficient.

1st, I assumed the onboard DRAM would be accessed with memory mapped reads & writes, but that's not possible because the ISA bus only has a 1 MiB address space. Even if the address space was big enough, I assumed you could get close to the peak 16.6666 MB/s by doing a long burst access. Apparently, ISA doesn't have burst transfers! so those address setup and wait states aren't overlapping with data transfer.

Using voices for uploading soundfonts sounds like a very adhoc way of sharing the memory. It sounds like only the EMU8K can drive the DRAM instead of also allowing memory requests from ISA to be a bus master.

Hi, I've done the calculations here:

That sort of makes sense. It assumes only 1 sample per voice is read for each output sample. mkarcher suggested "likely 4 consecutive samples," though one would expect the 3 older samples to be buffered, not reread. Also, that upper bound speed only considers the speed between the EMU8K and its onboard RAM. It seems the ISA transfers are the real bottleneck.

I found this guide that explains in detail how the CPU reads and writes to the onboard DRAM.
https://www.dosdays.co.uk/media/creative/emu8kpgm.pdf

It does seem pretty inefficient. You have to write an address in addition to the data. Then, poll for completion as mkarcher mentioned.
I'm going write a program to find out exactly how long it takes.

Reply 4 of 7, by mkarcher

User metadata
Rank l33t
Rank
l33t
spacesaver wrote on 2025-02-27, 18:30:

This is pretty fascinating. I thought I knew a lot of computer architecture, but I didn't realize the ISA bus is so inefficient.

The ISA bus has been designed in a time in which cards were built from standard 74 series TTL logic chips, with maybe one higher integrated chip (like the 6845 CRTC on the MDA and CGA, the NEC µPD765 on the floppy controller). The bus is not optimized to be fast, it is optimized to be low effort to being interfaced to.

spacesaver wrote on 2025-02-27, 18:30:

Apparently, ISA doesn't have burst transfers! so those address setup and wait states aren't overlapping with data transfer.

This is mostly correct, but ISA has another pecularity up its sleeve: On the PC/XT, you are absolutely correct. Address and data transfer doesn't overlap, there is no pipelining. And transfers are slow, if you measure them in processor clocks. That's because the ISA bus is basically what you get if you combine the 8088 processor and the 8288 bus controller. The 8088 requires 4 cycles per bytes transferred (compare that to the 6502 that can transfer a byte every cycle). This makes comparing clock speed between Intel-style systems (including Z80 systems) and MOS-style systems quite difficult. OTOH, a C64 at 1MHz still as a lower bus transfer rate than a PC at 4.77MHz, so measured in throughput in actual systems, the ISA bus is not worse than the cartridge slot of a C64.

On the AT, though, the 286 processor was introduced. IBM kept the bus close to the processor (we are talking about 6MHz and 8MHz AT systems, not the later AT clones). The 286 can handle a 16-bit bus cycle in 2 clocks instead of an 8-bit bus cycle in 4 clocks. (if you look at execution unit clocks, that is). OTOH, if you look at the clock signal clocks, the 286 still needs 4 clock cycles per bus transfer, it's just that the execution clock of the 286 (and 386) is just have of the clock frequency you input at the clock pin, while the execution unit of the 8088 is exactly the clock at the clock pin. As the 8088 multiplexed address and data pins (Intel had no choice for the 8086: You have 16 dedicated data pins and 16 dedicated address pins on a 40-pin DIP package on a processor as complex as the 8086), the 8088 mainboards had to latch the address (and that's how the "address latch enable" signal came to be on the ISA bus: It's generated on the board, and it might be useful for some card as well, so why not put it on the bus). With the 286 having more pins, address/data multiplexing was no longer needed, and the 286 designers decided that in the last half execution clock (i.e. the last full clock pin clock) of one bus cycle, the address pins may already contain the address of a subsequent bus cycle, so the time from the address appearing at the processor address pins to the cycle being finished is not just 2 execution clocks, but 2.5 execution clocks. You don't get the .5 extra clock on the 8-bit ISA bus, though, as ISA is defined to have a valid address over the whole bus cycle. And that's why ISA added "unlatched" high address pins on the 16-bit connectors: Those pins are valid half an execution clock period earlier, but don't necessarily stay valid over the whole cycle. If you need that, you must add an extra latch on the card, and use the ALE signal to control it. The key point of having the top 7 address bits "early" on the bus is to give more time to the address decoder, and most importantly the circuit that decides whether a target supports 16-bit transfers.

spacesaver wrote on 2025-02-27, 18:30:

Using voices for uploading soundfonts sounds like a very adhoc way of sharing the memory. It sounds like only the EMU8K can drive the DRAM instead of also allowing memory requests from ISA to be a bus master.

Exactly that's how it works. The DRAM is not shared at all, but the DRAM is 100% occupied by the EMU8K, so the processor has to access the DRAM through the EMU8K, and thus you need to free up memory time by allocating "voices" to the RAM interface. You can see this in contrast to the IBM CGA card: It also had a fixed access pattern: In all video modes (except the 80-column text modes), the video card used 50% of the bus bandwidth, and the processor got to use the other 50% (yet it couldn't do it, as the duration of a single ISA cycle on the IBM PC is longer than the period between two processor-dedicated memory access time slots of the CGA, the the processor missed at least every other chance to transfer data). In high-res text mode, the required bandwidth doubled, so the CRTC required 100% of the bandwidth. Yet, the CGA allows the 8088 to access the video RAM, and if it does so outside of the blanking/retrace period, the CRTC gets its memory cycle stolen and the processor cycle is executed instead, causing the well-know "CGA snow". Using a similar scheme on the EMU8K would cause static noise during sample upload.

spacesaver wrote on 2025-02-27, 18:30:

mkarcher suggested "likely 4 consecutive samples," though one would expect the 3 older samples to be buffered, not reread. Also, that upper bound speed only considers the speed between the EMU8K and its onboard RAM. It seems the ISA transfers are the real bottleneck.

I did suspect 4 samples to allow good (3rd-order) interpolation. While you can buffer 3 older samples when slowing down the sample from the RAM/ROM, this will break down when you need to speed up samples, i.e. the sample position pointer advances more than 1 sample in RAM for one 44.100Hz sample.

spacesaver wrote on 2025-02-27, 18:30:

It does seem pretty inefficient. You have to write an address in addition to the data.

You can write the address once, so writing addresses does not contribute to the slowness:

EMU8K programming guide wrote:

If you wish to do a write transfer, simply write the data words to be transfered to sequential sound memory addresses into sound memory to SMLD (left) or SMRD (right). The address will be automatically incremented.

Reply 5 of 7, by spacesaver

User metadata
Rank Newbie
Rank
Newbie

XT/AT were before my time, so this is new to me. 486 was my 1st PC as a kid. You're talking about a time when memory & I/O latency was faster than the CPU clock! As far as I've ever known, that's never been true.

It sounds like with the exception of the IBM AT, there's no cycle when the address and data bus are used at the same time. Then, that begs the question why even bother having separate address & data lines?

You're saying on the AT, the address decoding was partially overlapped. But it doesn't sound like it saved any cycles. From what I'm seeing, the minimum ISA cycles for a read is 4, on XT, AT, and PC clones. It seems the overlap is only preventing the extra time to decode the extra 4 address bits from taking 5 cycles. Yet, the PC clones managed to decode 24 bits without needing overlap or an extra cycle. I'm still pretty unclear about minimum and average cycles, so might be wrong. I got the 4 cycle minimum for PC clones here, 486 Motherboard with abysmal ISA performance "Normal ISA read/write takes 4 cycles with r/w strobe lasting 1 clock cycle"

The address will be automatically incremented.

I didn't see that or the other steps to hijack voices for soundfont upload/download until now. I was only looking at the documentation for the address registers like SMALR, which don't mention auto increment.

Reply 6 of 7, by maxtherabbit

User metadata
Rank l33t
Rank
l33t
mkarcher wrote on 2025-02-27, 23:11:

Using a similar scheme on the EMU8K would cause static noise during sample upload.

Except for the fact the most of the time samples are being uploaded when nothing is playing. They should have just let you max out transfers during this time and muted the output

Reply 7 of 7, by spacesaver

User metadata
Rank Newbie
Rank
Newbie

Wow, the benchmark program works. Measured on my CT4380 and Pentium 200 MMX. The speed is close to the Soundfont manager speed I saw earlier, but needs more voices than the expected 10 to reach the speed. Maybe because I'm only using the left DMA channel. Also attempted to measure more detailed timing, but it gave unexpected results. It seems rdtsc doesn't count when inside I/O port instructions.

channels=2 86.392940 KiB/s
channels=6 144.633994 KiB/s
channels=10 179.660873 KiB/s
channels=14 259.160086 KiB/s
channels=18 331.750292 KiB/s
channels=22 353.104502 KiB/s
channels=26 431.940201 KiB/s
channels=30 480.886667 KiB/s

So it seems the minimum cost for each 2 bytes written is:
1. writing to data register, SMLD. 2 ISA writes
2. read status, SMALW. 1 ISA write, 2 ISA reads (because SMALW is 32bits)

So if ISA is the bottleneck, the max speed possible if assuming 4 ISA cycles per read/write is 0.83 MB/s. Very inefficient indeed.