VOGONS


Reply 40 of 44, by mkarcher

User metadata
Rank l33t
Rank
l33t
Deunan wrote on 2022-04-10, 12:08:

I'd like to add a few techical bits that are perhaps worth considering. The 386 has a NA# signal that allows for limited bus pipelining - it can drive a full set of control signals (not just the address) 1 bus cycle (2 clock cycles) earlier to allow the chipset to do things like cache tag checks or perhaps address output if 2 or more RAM banks are used. Though frankly it's rather difficult to do properly and I have not seen it used (but then again I have not probed each and every mobo I have).

I have read about NA# and its use in the "Compaq Deskpro 386/20 Technical Reference Manual". It is based on the Intel 82385 cache controller. The ISA bridge is custom logic, as I understand it.

I have seen a VLSI VL82C320 / VL82C331 (TOPCAT) aka Intel 82343/Intel 82344 based mainboard with a jumper called "pipeline", and experienced higher bus performance with pipelining enabled than with pipelining disabled.

The bus timing behaviour is quite different between the 286 and the 386DX. The 286 had the behaviour I described: The address is presented half a processor / bus clock (i.e. a full CLK2x clock) early, unconditionally. The 386 had behaviour Deunan described: The "NA#" pin that allowed the chipset to voluntarily release the current address from the bus, so the next address and the associated control signals (if already known) could be presented a full processor / bus clock (i.e. two CLK2x clocks) early. This difference between the 286 and 386DX bus protocol makes it obvious that the 386SX is not just a 386 core with a 286 bus interface unit. The 386SX uses the same timing and pipelining capability as the 386DX. It's bus timing is incompatible with the 286.

A lot of late 80286 chipsets (the TOPCAT I mentioned above is one of these) support both the 80286 and the 80386(SX) bus timing, so the same chipset can be used with either processor. You could even easily build mainboards that supported both kinds of processor, if some logic detected which processor socket is populated or a jumper was used to configure 286 vs. 386(SX) bus mode. The existence of such boards reinforced the misconception that the 386SX uses the same bus protocol as the 286.

Reply 41 of 44, by Deunan

User metadata
Rank Oldbie
Rank
Oldbie

Eh, you make it sound worse than it is. The 286 and 386SX busses are not incompatible. Somewhat different, yes, but you can bolt ISA to both CPUs without too much trouble.

It's true the 286 had this whole new "bus pipelining" as Intel described it, but in reality it has the exact same timings as 386SX. The only difference is the whole thing is shifted 1 CPU clock cycle - but since the execution unit in both CPUs is actually decoupled from bus unit, it's just a matter of how we choose to name the cycles and which CPU part is the "main" one and therefore starts the official cycle count. Looking at the problem from 386 perspective one could argue the 386 has both ALU and bus running in sync (with regard to cycle timings) and 286 has the ALU trailing bus by one clock cycle. In the end though the chips outside the CPU do not care what the ALU is doing, frankly the more important issue would be CLK2 phase difference between 286 (cycle starts on falling edge) and 386 (cycle starts at rising edge) - but in these days RAM, and busses, were asynchronous.

On 286 you need an extra state decoder chip (and clock generator), but ALE (generated by 82288) is latching address exactly 2 clock cycles after it was output to the bus. On 386 the CPU has enough pins not to need any extra logic to drive memory and I/O signal, and instead of ALE we get ADS# which serves the same function, and has the same timing. Actual data read also happens exactly 2 cycles later on both CPUs. The only real difference is the 82288 has a few extra signals that are somewhat longer or shorter and these are meant to directly drive bus transceiver chips for example. I guess by the time 386 arrived the assumption was there would be a more integrated chipset that didn't need 74 class chips for that, and it would either be fast enough, or it would introduce its own waitstates as is the case with this particular mobo. But I see it as a limitation of the chipset, not the CPU. After all once we raise the CPU clock well above ISA speeds we have to somehow decouple the two anyway. This is why VLB was invented, to work around the narrow and dumbed down ISA.

So while you cannot just replace a 286 with 386SX on a mobo, the glue logic (especially on the early, slow 16MHz 386) was mostly re-used from 286 days. The real problem with this mobo is it tried to do the same with 486 and most likely run into further problems that are cache related.

Reply 42 of 44, by mkarcher

User metadata
Rank l33t
Rank
l33t
Deunan wrote on 2022-04-11, 16:22:

The 286 and 386SX busses are [...] Somewhat different, yes, but you can bolt ISA to both CPUs without too much trouble.

With just a little quote-mongering: Full agree on that. The bolts have a slightly different shape, though. ISA is a good fit for both the 286 and the 386SX, although both of those processors would be best served by 16-bit ISA running at the full processor clock, just like IBM did it in the 5170 AT. For compatibility reasons, we do not get that at processor clocks above 12MHz, though.

Deunan wrote on 2022-04-11, 16:22:

It's true the 286 had this whole new "bus pipelining" as Intel described it, but in reality it has the exact same timings as 386SX. The only difference is the whole thing is shifted 1 CPU clock cycle - but since the execution unit in both CPUs is actually decoupled from bus unit, it's just a matter of how we choose to name the cycles and which CPU part is the "main" one and therefore starts the official cycle count.

For me, the main point is: On the 286, address pipelining is mandatory. You lose the address bits on the processor pins before the cycle is over. On the 386SX, it's the choice of the mainboard chipset when to assert NA# (or if at all), so bus pipelining is optional on the 386SX.

On the 286, we get only the address and the type (memory/IO) "in advance", but the command (S0, S1) will be presented after completion of the previous cycle. With "completion" I am talking about that point in time where read data is sampled by the processor in a read cycle. On the 386SX, we get the whole command (including direction information) "in advance" in a pipelined cycle. Also, the time from address valid to sampling read data is slightly below 2.5 processor clocks on the 286 (0WS), but slightly below 2.0 processor clocks on the 386SX (0WS, non-pipelined) or slightly below 3.0 processor clocks (0WS, pipelined).

Harris made a paper on "why the 286 beats the 386SX", which is biased due to Harris having a license to make 286 chips, but no license to make 386 chips, and they make a big deal of the 386SX not being able to run a pipelined cycle after the bus being idle. If your board is designed to cope with 2.5 clocks address-to-data time, the 386SX is indeed a loss, because you can not run a 2.5-clock cycle after the bus being idle, but you need to add a wait state to the non-pipelined (2.0-clock) cycle, and only subsequent pipelined cycles that allow 3.0 clocks address-to-data can be served by a design that needs 2.5 clocks. On the other hand, if your design is able to cope with 2.0 clocks address-to-data cycles, or need 3.0 clocks anyway, the 386SX bus timing might be advantegeous.

Reply 43 of 44, by Horun

User metadata
Rank l33t++
Rank
l33t++

Ohh just go ahead and state it: the chipset is not optimized for a 486 and better suited for a 386. So like that odd Opti based board that can run either 386 or 486 was not really refined for a 486 but ran OK with one but nothing to brag about ;p

Hate posting a reply and then have to edit it because it made no sense 😁 First computer was an IBM 3270 workstation with CGA monitor. Stuff: https://archive.org/details/@horun

Reply 44 of 44, by Deunan

User metadata
Rank Oldbie
Rank
Oldbie

Yeah, the OPTi chipsets that work with both 386DX and 486 are not all that great performance-wise. Perhaps it depends on the particular chipset and implementation as well but the mobo I have pretty much adds 2 extra cycles to RAM read. If you hit mobo cache then the data will be presented to CPU at the end of the 2nd cycle (so 0WS from CPU perspective), but if not these 2 are now basically waitstates and only now RAM read happens. Worse yet, the tag check is not all that fast and it's better to have just one bank of cache populated for 386, since it cannot do bursts, so with 2 banks the tag check needs more time or way faster chips to be stable. On the other hand with 486 you want 2 banks because of possible interleave, even if you have to add an extra WS to the tag lookup. Or in other words, 386DX should preferably use 2-x-x-x (x doesn't matter on 386) cache timing settings, and on 486 3-1-1-1 is preferable to 2-2-2-2 (assuming 2-1-1-1 is not stable, and probably won't be).

There's 2 more issues with 386 vs 486 that are not specific to any chipset brand but do require different addressing logic. First is the fact that 486 expects to see the data on different part of the bus during 8- and 16-bit transfers, and this is a major issue. It requires the chipset to re-route the signals internally to present them to CPU, which is possibly another reason to have an extra cycle thrown in there - it would serve as recovery for the bus and allow propagation and stable output on the chipset lines to CPU. And the funny thing is, Intel changed that between 386 and 486 but apparently didn't bother to add the logic into CPU to make it more sane (like, say, always expect 8-bit data on the D0-D7 pins, which would make bolting 486 to 8-bit memory and I/O trivial).
The second problem is pseudo-locks. Seems like Intel was considering multi-CPU operation with 486 and added things like pseudo-locking to make sure reads and writes of data chunks longer than 32-bits (FPU registers, segment descriptors, etc) happens atomically. With the on-chip cache this can be an issue even in single-CPU systems with DMA so the chipset should properly support it. In theory this would not really affect the timings but again it adds to the chipset logic complexity so it just might.