VOGONS


First post, by Marco

User metadata
Rank Member
Rank
Member

Hello all,

I have here quite a deep-tech question. Already know a lot about dos extender and its advantage of direct memory accesses and bypassing 640k barrier.

What is it about:
- With the introduction of DOS4GW all games using this protected mode came with higher HW requirements mainly 4MB of RAM
- This is valid also for games that are technically identical to their pre-successors using real time mode "engines". Mainly Sierra adventures. Why is that?

System setup:
- 386 SX / 25 with 2MB later on 4MB RAM
- Focus on games like Sierra Quest for Glory 3,4 Police Quest 4, Gabriel Knight

Examples:
- Sierras adventures using DOS4GW extender needed 4MB of RAM at minimum even providing the same graphic detail level then their pre successors.
- Even better: I was having a Sierra Online Demo CD which shows a Demo version of Quest for Glory 4 running in realtime mode without Dos4GW on my 386SX/25. Result: Flawless smooth as usual, no issues with 2MB of RAM. The final version of QfG4 were using the DOS4GW and was terribly slow and required 4MB RAM. Cost me 400DM back in the days to upgrade btw.
Also Larry 6 as an exceptional non dos 4gw Game ran fantastic.

Question:
Why is that? The 386SX supported 32Bit addressing instructions. Is it why it had to translate 32Bit instructions to its 16Bit bus which could let the double amount of CPU cycles per instruction (waitstates?). I never had a comparison with a 386DX/25 unfortunately but I could find some Forum remarks in the internet stating that 386sx performance Fall quite a lot behind their DX counterparts while using dos 4gw.

Thanks a lot

1) VLSI SCAMP 311 / 386SX25@30 / 16MB / CL-GD5434 / CT2830/ SCC-1&MT32 / Fast-SCSI AHA 1542CF + BlueSCSI v2/15k U320
2) SIS486 / 486DX/2 66(@80) / 32MB / TGUI9440 / LAPC-I

Reply 1 of 23, by Jo22

User metadata
Rank l33t++
Rank
l33t++

Re: What is the main bottleneck in dos programming?

Not sure if that's helpful, though.
Maybe the 386SX has to use some external logic to interface with the 286 bus (ISA).

The 386DX technically can work in 386SX "mode", too, btw.
It has the ability to transfer in 16 Bit chunks like a 286 and 386SX (did notexist yet when the original 80386 was made).

"Time, it seems, doesn't flow. For some it's fast, for some it's slow.
In what to one race is no time at all, another race can rise and fall..." - The Minstrel

//My video channel//

Reply 2 of 23, by mkarcher

User metadata
Rank l33t
Rank
l33t

DOS4GW most likely isn't the cause of the performance issue, but just a piece in the puzzle. Performance in protected mode and in real mode is not significantly different. You lose some performance when you run with paging enabled (EMM386 does it, for example), but unless EMM386 is already loaded, DOS extenders like DOS4GW don't enable paging.

DOS4GW is a DOS extender that came with the Watcom C/C++ compiler. So code using DOS4GW is most likely compiled with that C compiler, as 32-bit code. And that is the main point. In 32-bit code, all pointers take 32 bits, and integers (by default) use 32-bits, too. In standard 16-bit DOS code, most integer variables are just 16 bits wide, and depending on memory model, pointers might also be limited to 16 bits. At the same time as going from a real-mode 16-bit game engine to a 32-bit game engine, probably a lot of highly optimized assembler code was dropped and replaced by better maintainable C code. If code uses 32-bit pointers and 32-bit integers, it issues a lot of 32-bit memory access cycles. Those are twice as fast on the 386DX than on the 386SX. So the bus is a bottleneck for mostly for 32-bit code. The performance penalty of the 386SX on 32-bit code combined with less optimized (but possibly more generic) game engine code are likely the explanation for the low performance of the 386SX in 32-bit Sierra games.

Reply 3 of 23, by Marco

User metadata
Rank Member
Rank
Member

Thanks a lot guys. That sounds like a good explanation.

Thanks again

1) VLSI SCAMP 311 / 386SX25@30 / 16MB / CL-GD5434 / CT2830/ SCC-1&MT32 / Fast-SCSI AHA 1542CF + BlueSCSI v2/15k U320
2) SIS486 / 486DX/2 66(@80) / 32MB / TGUI9440 / LAPC-I

Reply 4 of 23, by pentiumspeed

User metadata
Rank l33t
Rank
l33t

Not getting the hint that fact is 386*SX* is 16 bit data path even the processor core is 32 bits. Any execution of any 32 bit instructions imposes big penalty on extra cycles due to transferring 16 bits twice to do a 32 bit execution then 2 times on 32bits data again then again. , Again.

If you had a 386DX which is fully 32 bits, it will do it with no latency on doing 32 bits data and instructions.

PS: if there is a cache on the motherboard with 386SX, this does big help indeed.

Cheers,

Great Northern aka Canada.

Reply 5 of 23, by Marco

User metadata
Rank Member
Rank
Member

Thanks as well. That’s what I initially meant with:

„Is it why it had to translate 32Bit instructions to its 16Bit bus which could let the double amount of CPU cycles per instruction (waitstates?). “

I‘d really like to see an identical app/game benchmark once with dos extender once with realmode 16bit

1) VLSI SCAMP 311 / 386SX25@30 / 16MB / CL-GD5434 / CT2830/ SCC-1&MT32 / Fast-SCSI AHA 1542CF + BlueSCSI v2/15k U320
2) SIS486 / 486DX/2 66(@80) / 32MB / TGUI9440 / LAPC-I

Reply 6 of 23, by AlexZ

User metadata
Rank Member
Rank
Member

386SX was crippled so it was sort of like a 286 with 386 instructions but not capable of executing 32bit code very fast. It was meant to be used for mostly 16bit software with the option of executing 32bit code in theory. As far as I remember it didn't even have a cache. As described above it suffers huge penalties with 32bit code.

386DX/40 is what you need for play early DOS4GW games. But only few are playable as usually memory (can be upgraded to 8MB), slow ISA bus speed (affects video transfers, partially solvable by running ISA at 12Mhz) and CPU became bottlenecks. It is true that those early games do not offer better experience than those 16 bit real mode ones coded for 286.

Pentium III 900E, ECS P6BXT-A+, 384MB RAM, NVIDIA GeForce FX 5600 128MB, Voodoo 2 12MB, 80GB HDD, Yamaha SM718 ISA, 19" AOC 9GlrA
Athlon 64 3400+, MSI K8T Neo V, 1GB RAM, NVIDIA GeForce 7600GT 512MB, 250GB HDD, Sound Blaster Audigy 2 ZS

Reply 7 of 23, by bakemono

User metadata
Rank Oldbie
Rank
Oldbie

It should also be noted that running in 32-bit protected mode is a choice of the developers, and one of the biggest reasons to make that choice is needing more than 640KB of memory. It means the developers set out to make a heavier game. So the situation is more like "higher memory requirements lead to using DOS4GW" and not "using DOS4GW leads to higher memory requirements"

If I'm not mistaken there are also instructions on 386/486 which take more cycles to execute in protected mode because of MMU overhead, so that can also slow things down a bit.

again another retro game on itch: https://90soft90.itch.io/shmup-salad

Reply 8 of 23, by Jo22

User metadata
Rank l33t++
Rank
l33t++
AlexZ wrote on 2022-08-18, 15:04:

386SX was crippled so it was sort of like a 286 with 386 instructions but not capable of executing 32bit code very fast. It was meant to be used for mostly 16bit software with the option of executing 32bit code in theory. As far as I remember it didn't even have a cache. As described above it suffers huge penalties with 32bit code.

The 386SX really was a castrated i80386.
That's how the 386 was called originally, before the DX suffix was introduced.

From what I remember, the critics of day were really disappointed by Intel's announcement of the 386SX (terms "lazy", "boring" fell).
That's what computer magazines from the late 80s said, at least.

Ironically, the 80386/386DX did already have the ability to use 16-Bit I/O.
The BS16 pin can be used for switching between 16/32-Bit data size.

So the 386SX did not introduce anything new, really. It's relates like an 8088 to an 8086.
- At least, it gave new life to intelligent, mature 80286 chipsets (they used to be great; UMBs and EMS in hardware; no EMM386/V86 needed).
That was it only real right to exist, maybe.

The only notable difference was the 386SLC, a low power notebook processor.
It introduced things like SM BIOS and power-savings support.

However, the most castrated spin-off was the 80387, perhaps.
It had both 16-Bit/32-Bit registers (AX, EAX etc) but was 32-Bit Protected Mode only.
No v86, no paging. But the MMU's segmentation unit was still operational.

AlexZ wrote on 2022-08-18, 15:04:

386DX/40 is what you need for play early DOS4GW games. But only few are playable as usually memory (can be upgraded to 8MB), slow ISA bus speed (affects video transfers, partially solvable by running ISA at 12Mhz) and CPU became bottlenecks.

Yes, the 386DX-40 was neat. My father used one for professional software development in the early/mid 90s.

By dividing the clock by 4 (or was it 8? 80 MHz oscillator), the ISA bus could be set to a clean 10 MHz.

Which was a bit less restricting that 8.33 MHz.
Speaking of overclocking, 12 MHz (as you said) to 16 MHz was possible with good hardware.
The ~16,66 MHz were ideal, in theory, because they'd be exactly twice the default clock.
Programs like MOD4WIN encouraged ISA bus overclocking in their help files.

AlexZ wrote on 2022-08-18, 15:04:

It is true that those early games do not offer better experience than those 16 bit real mode ones coded for 286.

+1

I remember, those LucasArts DOS4GW games like Sam&Max running on a slow 386
performed worse than these 16-Bit Sierra VGA titles (Larry 1 and Space Quest 1 remakes) on a 10 MHz 286.

I do like Sam&Max Hit The Road a lot, but the engine was scumm.
Moving very slow. As ig being tarred and feathered.

That's why I never loved that 32-Bit and flat-mide cult.
Real-Mode and 16-Bit Protected-Mode were uncomfortable to work with,
but software using it often was not that slow in practice.

And then there's the slow down of V86/the 80386 MMU's Paging Unit.
The 386 didn't support Enhanced V86 (aka VME) yet, also.
The 586 (aka Pentium) core and late 486 cores had VME.
QEMM 7 was one of the early memory managers with officiall support for it.
There's even a sticker on the big-box that mentions special support.

Edit: Typos fixed. Sorry, working from a smartphone.

"Time, it seems, doesn't flow. For some it's fast, for some it's slow.
In what to one race is no time at all, another race can rise and fall..." - The Minstrel

//My video channel//

Reply 9 of 23, by Horun

User metadata
Rank l33t++
Rank
l33t++

Good explainations of the diff of SX vs DX !

Hate posting a reply and then have to edit it because it made no sense 😁 First computer was an IBM 3270 workstation with CGA monitor. Stuff: https://archive.org/details/@horun

Reply 10 of 23, by rasz_pl

User metadata
Rank l33t
Rank
l33t
pentiumspeed wrote on 2022-08-18, 14:35:

PS: if there is a cache on the motherboard with 386SX, this does big help indeed.

Biggest help would be 1KB cache on the CPU itself, as evidenced by 486SLC running faster than 386DX https://youtu.be/ldYQQPYlRAU?t=220

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 11 of 23, by jakethompson1

User metadata
Rank Oldbie
Rank
Oldbie

Another part of this is that a DOS extender, Windows 386-enhanced mode included, is actually an operating system that picks and chooses which DOS calls it wants to implement and which it wants to pass back to DOS or the BIOS, running them in Virtual 8086 mode. The book Undocumented Windows 95 has a good explanation of this. For example, a DOS extender author would want to substitute their own code for things like memory allocation (being the whole point of the extender) while passing calls to open a file, read a file, list a directory, etc., back to DOS so as not to have to implement that functionality. So that explains some of the overhead.

As to you not feeling the increased requirements have much of a payoff... isn't that the whole history of the evolution of personal computers? Considering we have on a typical system 500 times as much memory as in the dial up era, and say 50 times the CPU power, but the modern web certainly isn't 50 or 500 times better because the bloat cancels out much of the benefit. With the exception of streaming video. Writing in 32-bit C for a PC in 1990 rather than hand optimized assembly was probably viewed at the time like writing an Electron app is today...

Reply 12 of 23, by Marco

User metadata
Rank Member
Rank
Member

Thanks again all for contributing.

My final „mind blowing“ example here is:

LSL6 was one of not the last Sierra game w/o DosExt. Its MSDOS SVGA version runs (much) better than any upcoming VGA games w/ DosExt as GK, PQ4, QG4.

You mentioned the points why that is. So thanks.

1) VLSI SCAMP 311 / 386SX25@30 / 16MB / CL-GD5434 / CT2830/ SCC-1&MT32 / Fast-SCSI AHA 1542CF + BlueSCSI v2/15k U320
2) SIS486 / 486DX/2 66(@80) / 32MB / TGUI9440 / LAPC-I

Reply 13 of 23, by megatron-uk

User metadata
Rank Oldbie
Rank
Oldbie

From a programmers perspective, however, it is much, much neater to work in a mode where you can relatively (and I mean relatively) easily access any memory you want.

For anyone coming from a platform not constrained by the legacy of Dos/Xms/Ems/segments etc, the benefits of a 32bit extender runtime are very appealing... and you can understand why they were willing to sacrifice some speed and accept the small amount of overhead the environment has when run on low end systems.

My collection database and technical wiki:
https://www.target-earth.net

Reply 14 of 23, by megatron-uk

User metadata
Rank Oldbie
Rank
Oldbie

Question:
Why is that? The 386SX supported 32Bit addressing instructions. Is it why it had to translate 32Bit instructions to its 16Bit bus which could let the double amount of CPU cycles per instruction (waitstates?). I never had a comparison with a 386DX/25 unfortunately but I could find some Forum remarks in the internet stating that 386sx performance Fall quite a lot behind their DX counterparts while using dos 4gw

Instructions don't take longer on 386sx Vs 386dx, it is the memory access which is slower*, since the data bus is only half as wide.

*If the instruction does not involve a memory access (e.g. it is register to register based), then the speed will/should be identical on both sx and dx. But since many (not all) instructions can involve a read or store to memory, that is where the sx falls down, and where even a very small cache (on chip or on board) makes a substantial difference.

My collection database and technical wiki:
https://www.target-earth.net

Reply 15 of 23, by kingcake

User metadata
Rank Oldbie
Rank
Oldbie
AlexZ wrote on 2022-08-18, 15:04:

386SX was crippled so it was sort of like a 286 with 386 instructions but not capable of executing 32bit code very fast. It was meant to be used for mostly 16bit software with the option of executing 32bit code in theory.

The execution in the CPU itself is not slower, it's the bus transfers that are slower. It has no problem executing 32-bit code. It didn't support 32-bit code only in theory.

Reply 16 of 23, by Deunan

User metadata
Rank Oldbie
Rank
Oldbie

I've already mentioned it in some other thread but 32-bit instruction set brought more than just wider registers. You can now access memory with every register, base and index can be any register as well (except ESP for index) rather than just BX/BP + SI/DI, and you can have 1/2/4/8 multiplier for the index too. All this allows much greater flexibility and optimal register usage but it costs additional byte to encode the instruction. For 16 bit code the average instruction length is about 2 bytes, exactly what you can fetch with 16-bit wide data bus. For 32-bit the average length goes up a bit, closer to 3 bytes. This is not an issue for 386DX that fetches code in 32-bit chunks (even in 16-bit mode), but 386SX bus is limited. Keep in mind every 386 must fetch each instruction to execute it, there is no on-chip cache.

I have not done any conclusive experiments but I have a theory that with 16-bit code the 386DX is limited by the execution speed of the instructions more than anything (all other things being equal like RAM wait states) while 386SX suffers a bit already because it can't really prefetch as much so there will be stalls on code with a lot of memory accesses that keep the bus unit busy. But at this poing the 386SX is more or less equal to 286. With 32-bit code though the SX chip will stall even more, each instruction longer than 2 bytes will take at least 2 bus cycles to fech now, so even if it could be executed faster the CPU just can't feed itself. Therefore the penalty for 386SX is greater than 386DX.

There might be other hidden penalties of using DOS extenders, like managing A20 gate, enabled paging (mostly to hide the memory gap between 640k and 1M). Recently I found out that loading HIMEM makes some protected code run faster - I think it has to do with A20 gate and/or real-mode interrupt calls, DOS extender will let HIMEM worry about that and perhaps Microsoft optimized their code much better. This of course will be mobo dependent, HIMEM might know how to utilize fast A20 gating and the extender will be using slower KBC path for example. The point here is comparing 386 DX and SX you are also comparing the mobo chipset quality - chances are the SX is just some cheaper variant that was not optimized at all for protected mode. It will work but perhaps slower than a good 386DX system, this will only slow the SX chip even more.

Reply 17 of 23, by Marco

User metadata
Rank Member
Rank
Member

I understand. At the end it will be a mix of both: - disadvantages of the 386sx in protected 32bit mode
- a generic slow down compared to optimized real mode programs. Meaning also the DX will suffer Performance wise

1) VLSI SCAMP 311 / 386SX25@30 / 16MB / CL-GD5434 / CT2830/ SCC-1&MT32 / Fast-SCSI AHA 1542CF + BlueSCSI v2/15k U320
2) SIS486 / 486DX/2 66(@80) / 32MB / TGUI9440 / LAPC-I

Reply 18 of 23, by rasz_pl

User metadata
Rank l33t
Rank
l33t

https://virtuallyfun.com/2024/01/29/phar-laps … d-it-for-games/ covers why games using extenders run slower - switching to protected and back. I didnt think one would need all that switching once everything is set, but author shows example code compiled with and without extender with ~2x speed difference on 386.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 19 of 23, by Marco

User metadata
Rank Member
Rank
Member

Funny as I also found that page yesterday 😀
Indeed interesting one.

A pity that I don’t know about more examples of games where one version was realmode and the other protected mode 😀

1) VLSI SCAMP 311 / 386SX25@30 / 16MB / CL-GD5434 / CT2830/ SCC-1&MT32 / Fast-SCSI AHA 1542CF + BlueSCSI v2/15k U320
2) SIS486 / 486DX/2 66(@80) / 32MB / TGUI9440 / LAPC-I