jakethompson1 wrote on 2025-07-22, 02:35:
I have been messing with the UMC 418 SVGA I have been posting about recently on an MB-1433UIV, which is a UMC 498 chipset board.
When I switched out a 486DX2-66 for a UMC Green CPU and raised the bus to 40 MHz, I now have to enable "CPU ADS# Delay 1T or Not" when using the VLB SVGA (ISA SVGA of course is fine).
OK, so this shows that something requires the increased setup time of the address and/or command lines at 40MHz.
jakethompson1 wrote on 2025-07-22, 02:35:
But: cache read performance as reported by cachechk decreases from 51.6 MB/s to 47.9 MB/s with this enabled; does that mean the ADS# signal going into the cache controller is also delayed? Is that a needless delay or could the extra loading from the VLB card's presence (as opposed to the bus speed or speed of components on that card) be what is causing the problems on the bus, and the card might affect what the cache controller sees on the bus as well, so therefore its sampling of the bus should also be delayed?
The option is called "CPU ADS# Delay", not "VL ADS# Delay", so it is not surprising that ADS# going anywhere, including the cache controller, is delayed by that option. 51.6MB/s is 310ns/cache line. 47.9MB/s is 334ns/cache line, so the difference is in fact 25ns per cacheline, which is one clock. As the cache line is bursted, only a single ADS# is required. Interestingly, the rates are 12.4 and 13.4 FSB cycles per cache line, which is way more than the 5 cycles (or 6 cycles with the extra ADS# delay) required by a 2-1-1-1 burst. This is not surprising, though, as it is well-known that REP LODSD on the 486 processor is not optimized in microcode and does not hit the bus limit.
jakethompson1 wrote on 2025-07-22, 02:35:
As I understand it looking over the VL-Bus spec, LDEV# has nothing to do with ADS# or the clock, and is constantly being generated by cards from the current state of the bus.
Delaying chipset sampling of LDEV# avoids an erroneous ISA cycle if VLB devices are too slow to assert it.
Wow, I didn't know the VL specification is publicly available. And now, looking for it, I found a 7-year old VOGONS post. Yeah, LDEV# is typically generated combinatorically from the address and command signals. So LDEV is generated by an VL card for I/O address 3C0, but not for memory address 3C0. On the ISA bus, 16-bit negotiation was also generated combinatorically before the command line going active (/MEMR, /MEMW, /IOR, /IOW), but as the cycle type was not visible before the command was active, on ISA we have seperate lines for "if the address that is currently visible on SA0..SA9/SA15 is an I/O address, please perform a 16-bit cycle" and "if the address that is currently visible on LA17..LA23 is the memory address of the next cycle, please perform a 16-bit cycle".
To be clear, the LDEV# delay is not just about errorneously starting an ISA cycle, but if LDEV# is sampled too early and misses the LDEV# signal of a VL card, you have two devices handling the same cycle, which may conflict on RDY# and in case of reads also cause a bus conflict on the data lines.
jakethompson1 wrote on 2025-07-22, 02:35:
How frequent would the opposite problem happen--LDEV# is asserted spuriously due to some intermediate state on the bus, and the card is too slow to de-assert it before the next clock cycle, so an access to an ISA device is missed?
"Missing an access to an ISA device" is way worse than you make it sound to me. If the chipset sees a spurious LDEV# signal, it assumes the VL card will generate (L)RDY#, and never supply RDY# itself, so you get a hard lock up of the front side bus, because no device is going to terminate that cycle. I expect that both problems (errorneously missing LDEV# and spuriously recognizing LDEV#) are equally likely, as the both kinds of errors are caused by the setup time of the address and command lines relative to the LDEV# sample point being too short. This can both cause LDEV# to be "still active", because the previous cycle did access that VL card that does not release LDEV in time and LDEV# to be "still inactive" if the previous cycle did not also hit the card and and the card doesn't manage to assert LDEV# in time.
jakethompson1 wrote on 2025-07-22, 02:35:
Motherboard cache and DRAM hits supersede LDEV# sampling, and for this reason, delaying LDEV# sampling should never slow down the cache or DRAM.
The penalty for delaying LDEV# sampling is paid on all VLB cycles and all ISA cycles.
I sincerely hope every VL chipset works that way. And in fact I did observe horribly bad ET4000 ISA performance if late LDEV# sampling is activated, even if no VL card is installed.
jakethompson1 wrote on 2025-07-22, 02:35:
VLB ID2/ID3 jumpers would have no effect and can't solve this problem, because the chipset doesn't care about those jumpers
Reading the actual specification is interesting. Before reading the specification, I believed that ID3 ("CPU speed") actually indicates the LDEV# sample point, with "<=33MHz" meaning "early" aka "end of first T2", and ">33MHz" meaning "late" aka "end of second T2". Obviously, this is not true, and the only specification given for the LDEV# sample point is that it should be 20ns after the address got valid. That's a full clock period at 50MHz, but depending on the speed of the mainboard, maybe sampling LDEV# early is still just possible at 50MHz. Address and command lines are valid some time befor the end of T1, so the "early" sample point is the "Address/Command setup time to rising CLK with ADS# asserted" plus a whole bus clock, minus the delay in the chipset LDEV# recognition circuit (including the gate combining the totem-pole LDEV signals from multiple slots). But think this is enough of parrotting section 3.1 of the VL 2.0 specification for this post 😉 .
The specification also contains the rationale for ID2 (allow 0WS writes). This signal is meant to forbid VL cards to drive RDY# on the first T2, as the chipset might drive RDY# during the first T2 cycle. I first wondered why only 0WS writes can be forbidden, but there is nothing about 0WS reads (which makes no sense, because if it is about arbitration who may drive RDY#, the requirement for VL cards to back off RDY# is valid for reads as well as writes. I finally started understood that 0WS reads are generally not supported, even if 0WS writes are. OK, section 2.2.2 of the VL 2.0 specification makes it clear: VL cards are not supposed to implement 0WS read cycles, because cache controllers may drive data from a speculative enable the data output of the cache chips on reads before even decoding whether an address is cacheable, so the board claims the right to own the data lines on the first T2, no matter what address is accessed, so a 0WS read can not be implemented, as that would require the VL card to drive the data lines on the first T2.
I assume the 0WS write stuff is similar. A board that does not allow "high speed writes" may reserve the right to enable the output driver for RDY# during the first T2 speculatively for 0WS cache hit writes. On cache misses and non-cacheable addresses (e.g. VL memory space), the driven output will stabilize to high early enough before the end of the first T2 to satisfy the 486 RDY# setup time requirement, but a VL card can not interfere here. It makes some sense to have this feature speed dependent (most mainboard manuals ask you to allow 0WS writes on <= 33MHz, and forbid 0WS writes on > 33MHz), because the cacheablity decision can start as soon as the addresses are valid (just like the LDEV determination), and if the FSB clock is low enough, the decision that an address is not in DRAM range may still happen during T1, so the chipset would know it doesn't need to drive RDY# on the first T2 (there are no non-cacheable, non-DRAM 0WS cycles), but at higher FSB frequencies, the decision that the cycle clearly can not be a mainboard 0WS cycle might happen only after the rising clock edge that starts T2, so a VL card shouldn't drive RDY#.
Interestingly, most VL cards just ignore the ID bits, even if some chips (e.g. the S3 Trio64) have configuration bits like "allow 0WS write cycles" and "decode addresses on ADS# or one cycle later".
jakethompson1 wrote on 2025-07-22, 02:35:ADS# is asserted by the CPU (and then de-asserted) to indicate the start of a bus cycle
Delaying ADS# allows the address and bus […]
Show full quote
ADS# is asserted by the CPU (and then de-asserted) to indicate the start of a bus cycle
Delaying ADS# allows the address and bus cycle definition signals to stabilize for an extra entire clock cycle before VLB cards try to interpret them
I don't follow exactly what issue requires that--ADS# and the subsequent rising edge get seen by the VLB card too early, or too late, versus the address and bus cycle definition signals?
Is this caused by the motherboard layout and/or capacitive loading from other local bus devices being too high for the current clock rate, slowing down the arrival of bus signals at the SVGA VLB card? Or is it that the propagation through the logic on the VLB card is too slow once the signals have arrived? (or some combination of both)
At least in the case of the UM498, delaying ADS# slows down all cache and DRAM access by one cycle as well
The idea of asserting ADS# one cycle late is to add a full FSB clock period to the setup time of the address and command lines. Those lines are guaranteed to be valid some setup time before ADS#. The master will drive the address and command lines some clearly defined time after the end of the previous clock (so some time into T1), so the remaining time of T1, which will be less the higher the FSB clock is is the setup time. You can look at 486 data sheets to find the maximum time between the previous rising clock edge and the validity of the address and command lines. The higher the 486 FSB clock specification, the less time the 486 may "waste" before having valid address and command outputs. The remaining time of the cycle is required to deal with charging capacitive loads on the front side bus (likely including VL devices, as VL is typically unbuffered), propagation along the traceson the board, and yet the signals have to arrive some time before the next clock edge at the VL target. The "some time before" (the setup time at the receiver) is meant for propagation delays through the logic on the VL card. The VL 2.0 specification guarantees 7ns setup time at FSB33 and 5ns setup time at FSB40 and FSB50. VL card designers should know this constraint and design their cards in a way that this amount of setup time is sufficient. Hmm, well... now look at the CL-GD542x data sheet that requires at least 8ns setup time for the address, command, and UADDR# line. UADDR# is meant to be decoded using external logic, and well, the signals may already be 1 nanosecond late at the inputs of the decoder if the VL board is at the edge of allowed timings. Now, if the inputs are 1 nanosecond too late, how the heck are you supposed to generate the output in time?! Delaying ADS# by one clock would surely help.
jakethompson1 wrote on 2025-07-22, 03:41:Because I had maxed out at 2-1-1-1 read and 1 WS, I tried replacing the 15ns "CE" brand Tag RAM that came with the board with a 15ns UMC one, and this seemingly has fixed the need for ADS# delay as well--as I know that often unbuffered address lines go to the Tag RAM, was the prior one loading down the address lines and slowing their change in state?
The CE tag RAM and the VL card graphics card together loaded the address lines hard enough that some required setup time was not met. The UMC RAM together with the VL card do not do that. Now, this can have two reasons: Either the CE tag RAM has a higher input capacitance than the UMC RAM, or (and that's what I suspect) the UMC RAM might actually be slightly faster than the CE RAM, and the critical path is not the VL, but the 2-1-1-1 cache lookup. So with the UMC tag RAM, the address lines are still as slow as they were with the CE RAM, but as the cache lookup is faster with the UMC RAM, the tag signals arrive in time for the 2-clock leadoff cycle when you use the UMC RAM, but they just miss the required "tag setup time" with the slower CE RAM.
You can verify whether that hypothesis is correct by re-installing the CE tag RAM, configuring no ADS# delay, but set the cache timing to 3-1-1-1. If my guess is right, this still works. Adding the ADS# delay slowed everything down by one FSB clock, which is a quite undirected approach. If the tag RAM access time is the limiting factor in your configuration, just slowing down the cache timing should help equally well, but keep VL performance high (although VL target performance might be limited with that UMC418 graphics card anyway).
jakethompson1 wrote on 2025-07-22, 02:35:
LRDY# is asserted by the VLB device to end the bus cycle
Assertion of LRDY# could occur before (on a read cycle) data headed from the VLB card to the CPU has stabilized, causing the CPU to see the wrong data?
VLB cards could suppress their own generation of LRDY# for one cycle (e.g., via ID3 jumper) OR the chipset could be software-programmable to delay the sampling of LRDY#, disregarding cards that assert it too quickly - the effect should be the same
A VL card is not supposed to assert LRDY# before it drives the data lines with valid data. The data lines are likely directly connected to the front-side bus, while the LRDY# signal might be buffered in the chipset. So it is extremely unlikely for LRDY# to "arrive before the data", as long as the VL card is compliant. You are right that a card that implements a ROM (e.g. a BIOS) might need an extra wait state if the FSB clock is high, so it might use the ID3 signal to decide the number of wait state for ROM reads. That's a valid approach.
You are also correct that the many chipset are able to do something that will effectively delay LRDY# by one clock. This is not meant to fix a broken VL target that asserts LRDY# before asserting the data, though. The reason for that chipset feature called "resynchronization" is to make sure that LRDY# meets the setup and hold time requirements of the 486 processor. The 486 processor requires LRDY# to not change from some time before the rising edge of the clock (the setup time) till some time after the rising edge of the clock (the hold time). If LRDY# is low all that time, the cycle is finished. If LRDY# is high all that time, a wait state is added to the cycle. If LRDY# changes during that time, undefined things might happen (most likely, it is just undefined whether LRDY# is recognized or not). VL cards are likely to output LRDY# some time after the previous clock edge, and the higher the FSB clock, the less time remains for LRDY# to propagate via the board traces and some processing in the chipset. At some FSB frequency, the LRDY# signal might be too late to meet the setup time requirement, so it might arrive during the sampling window in which LRDY# is forbidden to change. This is what "resynchronization" fixes: If the chipset "resynchronizes" LRDY#, it samples LRDY# a quite short time after the risign edge of CLK, and outputs that signal for the complete clock period, including the hold time of the next clock period. So if LRDY# is asserted later than the chipset sample point, LRDY# will not yet be seen by the processor at the end of that cycle, but at the end of the next cycle. Most importantly, though, this scheme ensures LRDY# at the processor side does not change during the setup and hold time.
jakethompson1 wrote on 2025-07-22, 02:35:
The penalty for delaying LRDY# generation by a card is paid only on a VLB cycle to that particular card; the penalty for delaying LRDY# sampling is paid on all VLB cycles across all cards.
If the chipset is not in LRDY# resynchronization mode (i.e. it is sampling LRDY# and forwarding it after the next clock started), it is in transparent mode, and will forward LRDY# as soon as it can (yet, that configurable latch/flip-flop/register will have some propagation delay), so it is not actually valid to talk about "sampling" in that case. I expect RDY# synchronization to not just apply to LRDY# from the VL slots, but likely also to RDY# generated from the ISA bridge, but that might depend on the chipset.
As the 486 processor reads the data when it sees RDY#, VL cards are required to keep driving data on read cycles until RDY# arrived at the processor (or whoever initiated that cycle). Due to the fact that LRDY# might be sent through an resynchronization circuit, or the reader might not be the processor at all, but an ISA bus master (including the ISA DMA circuit), the VL card can not rely on the read being done at the next clock edge. And that's why VL include RDYRTN#, which tells the card when the read cycle is actually over, so the card may stop driving the bus.