VOGONS


First post, by kool kitty89

User metadata
Rank Member
Rank
Member

feipoa brought the issue up of Direct Mapped cache performance with cache size combined with installed RAM here in a Luckystar 486E thread:
Re: LuckyStar LS486E rev.C2 and Cyrix 5x86@133

and the discussion continued on the following page
LuckyStar LS486E rev.C2 and Cyrix 5x86@133

I think he got the general concept of direct-mapped cache functionality right, but then went on to say:

256KB-WT & 8 MB, 16 MB, 32 MB, 64 MB. Note that with direct-mapped cache, 256 KB in write-through mode can cache up to 64 MB.
256KB-WT & 64 MB should provide the same results as 1024KB-WT & 256 MB, however 1024KB-WT & 8/16 MB should be faster than 256KB-WT & 8/16 MB.

I think that last bit would only be true when comparing programs that consume nearly all the system RAM and also make heavy/frequent access throughout the entire address range. Smaller programs running in a system with the same sized cache should have the same performance regardless of DRAM installed.

How efficient the cache usage is would depend how often the frequently-used-together blocks of code and data align with the cache mapping boundaries (and thus avoid thrashing) so larger vs smaller programs might not always show that sort of correlation either.

In particularly bad cases, even small blocks of code and data that should fit entirely into cache might have some conflict due to fragmented or mis-alligned memory allocation: this probably happens more often with multitasking in protected-mode and in some situations when using EMM386.

How direct-mapping actually works:

From what I understand, direct-mapped caches typically just takes the cache's memory size and then maps its addresses directly into the first block of RAM that size, then starts over with the next and so on: so 256 kB of cache would map directly into the first 256 kB of RAM, then directly into the next 256kB, then the next, and on and on.

That's also why direct-mapped cache benefits a lot from increased cache size. (or at least benefits proportionally more than set-associative schemes) Likewise, having a proportionally large cache compared to cacheable system RAM area is going to be more important for direct-mapping than associative schemes.

It's also simpler and faster to implement logic for than most other caching schemes, so would also be attractive in low-cost chipsets and for faster time to market and might be one of the reasons Cyrix's SLC and DLC chips support a direct-mapped mode as well as 2-way set-associative mode. (the direct-mapped cache logic would be more likely to be good in the event that the set-associative logic was bad in early silicon revisions)

You often get options for including or removing regions in the high memory area from being cacheable or not (BIOS region, video memory+BIOS region, and other upper memory blocks) which should just block certain blocks of address space mapped into the cache.

I'd also assume systems with hardware EMS support would explicitly omit the page-frame region from being cacheable. (that's probably mostly limited to 386SX boards with hardware EMS support plus cache or 386/486SLCs installed)

I think the page-frame (or entire UMA) being-noncacheable could actually give a performance edge to some EMS-enabled software running in real-mode as it'd only be code and data in the cacheable base memory (and upper memory area) that would compete for cache space and bank-switched EMS blocks would consistently be uncached. (I don't think there was any standars support for software-programmable non-cacheable address ranges, so more flexible optimization for that in protected mode wouldn't be possible: you can only resort to leaving unused blocks of RAM around presumed cache-alligned boundaries: or even testing for cache boundaries and modifying memory allocation when first launched)

There's also the 384kB (upper) memory relocation option that some chipsets/BIOSes support, and I'm not sure how that would impact cache either, or at least cache larger than 128 kB. (as a 384kB offset would shift the memory mapping bounds/allignment for 256k or larger direct-mapped cache) Though that's also only relevant on boards that remap that region into extended memory and not into expanded memory blocks. (the latter I think is limited to 286 and 386SX oriented chipsets)

Edit:
actually that probably doesn't impact cache functionality at all as it's the chipset playing around with the DRAM mapping into the physical CPU/system memory map. If anything it might have some impact on DRAM performance (DRAM pages and banks not alligning as efficiently for page-mode or especially for bank-interleave, and might just disable the latter) but shouldn't impact where code and data go in the memory map or how that relates to the cache allocation.

And then you have some boards with options for explicit, user-selected non-cacheable blocks throughout the entire address range as well as user-selectable cacheable memory range limits. (my Opti 495SX 386/486 board has options for 2 regions of non-cacheable area up to 512kB each along 512kB boundaries within the 64MB address range, plus a cacheable range of 4 to 64MB in 4MB increments)

Some software might also benefit from more RAM + limited cacheable range (like 8 or 16 MB of RAM but with the 4MB cache limit imposed) but that would largely be incidental and I'm not aware of any intentional uses of this. I don't think it was a standard BIOS/chipset feature so explicitly implementing that in software and recommending users configure the system as such would be a pretty niche market proposal. (much more so than the hardware EMS scenario or just requiring more RAM than strictly necessary in order to have reserved, empty/unused regions to avoid cache thrashing)

I think something like that would've been considerably more niche than software explicitly targeting vendor-specific features (particularly Non-Intel ones) like the scratcpad RAM function on several Cyrix CPU models. There might have been some motherboard chipsets that also had provisions for disabling some or all of the board-level cache for use as scratchpad RAM, but I'm not familiar with any of those.

Though that would've been a very neat and useful feature to have from a game developer's point of view and for certain graphics, sound, or multimedia applications, and especially since typical asynchronous cache SRAM can be used at 2x the speed as straight scratchpad RAM as it could be for cache. (or at least as write-back cache: I think WT mode usually works with somewhat slower RAM, but not quite half the speed)
That or if IBM (or a PC clone group cooperative aggreement) had reserved a portion of the high memory area specifically for scratchpad use. (especially with the limited register space available to 16-bit real-mode x86 processors and even to IA32)

Side-note, but 100 ns SRAM would be fast enough for zero wait state 20 MHz 386 or 286 access timing and was relatively cheap, commodity level stuff by the end of the 1980s. (and

Feipoa's comments were also specifically oriented towards cache in write-through mode, which I think avoids some or all of the performance hits to uncached memory access that many board-level write-back caching schemes seem to have.

I haven't found a way to switch between WT and WB modes on my OPTi 495SX board, but I know DRAM read times and throughput are dramatically faster with the L2 cache disabled and still somewhat faster in non-cacheable regions when cache is enabled (but not as fast as L2 disabled entirely). It's enough that some software runs faster with the L2 disabled and just the 486 on-chip cache enabled, including the PCPlayer benchmark in 640x480.

I don't think this DRAM performance issue is specifically related to direct-mapped caching schemes, though.