VOGONS


First post, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

For 386 processors (and up) it is my understanding that how unaligned memory accesses are handled is related based on bus width. Or isn't? What do I mean?

Well lets say I am doing a "LODSD" instruction from offset 1. On a 386SX processor this would first be broken up in 2 16bit accesses (as the bus of 386SX is 16bit). Then each one would have to be further broken up in 2 8bit accesses because of the offset being (odd) 1. So a 386SX would end up doing 4 accesses from offset 1:

8bit from offset 1
8bit from offset 2
8bit from offset 3
8bit from offset 4

Now I expect a 386DX to do exactly the same except via a different path as it would first try a 32bit accesss then realize it is unaligned then try 2 16bit address, those would be unaligned too.

Am I correct until now?

Now lets suppose we are doing same LODSD instruction but from offset 2 both on a 386SX and a 386DX. What happens now? I suspect the 386SX would be happy to do 2 16bit aligned accesses

16bit from offset 2
16bit from offset 4

Is it true, that yet again, the 386DX would still revert to 2 16bit access, same as the SX?

The 386DX would not start to shine until it is asked to do a LODSD read from offset 4 (or 0) because then it will issue a full 32bit memory access.

Am I correct? I spent some time looking at the 386 manual and this is what I came up with.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 1 of 36, by Scali

User metadata
Rank l33t
Rank
l33t

My understanding is that it is the other way around.
That is, a 386SX still 'thinks' like a 386DX in terms of memory access. So it initially generates 32-bit accesses, which are then broken up into 16-bit accesses.

Other than that, you can only have accesses of the size of your bus.
So 8-bit accesses only exist on an 8088.
CPUs with 16-bit buses will simply load an entire 16-bit word whenever they need a byte.
Alignment issues occur because words can only be accessed word-aligned.
So if you have an odd address for a word, it needs to fetch the two nearest words from even addresses, and then extract the relevant bytes from both words, to reconstruct the requested word (this is a somewhat special feature of x86, because of the legacy. Many modern CPUs simply do not let you access unaligned data in the first place. You see the same with SSE/AVX, where there are aligned load/store instructions and special (slower) instructions for unaligned access).

For 32-bit the same goes, except you now have dwords of 32-bit.
So worst-case you still need to do two accesses for an unaligned read.

This is where the '32-bit thinking' of the 386SX gets it in trouble: it will generate two 32-bit accesses worst-case, which will translate in 4 16-bit accesses down the line. However, in theory, it may have been able to have done it with just 2 or 3 aligned 16-bit accesses.
This is why a 386SX is somewhat slower than a 286. The 286 has more efficient memory access.

This was also mentioned in Abrash' black book:
http://www.jagregory.com/abrash-black-book/

A related cycle-eater lurks beneath the 386SX chip, which is a 32-bit processor internally with only a 16-bit path to system memory. The numbers are different, but the way the cycle-eater operates is exactly the same. AT-compatible systems have 16-bit data buses, which can access a full 16-bit word at a time. The 386SX can process 32 bits (a doubleword) at a time, however, and loses a lot of time fetching that doubleword from memory in two halves.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 2 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

Other than that, you can only have accesses of the size of your bus.

Thank you Scali. I think you are right. The bit above nails it too. So to put that in numbers a LODSD from address say 5 will generate the following bus transactions on a 386SX

32bit from address 4
32bit from address 8

which the actual 16 bit bus will transform into

16bit from address 4
16bit from address 6
16bit from address 8
16bit from address 10

it will discard bytes 4,9,10,11 and keep bytes 5,6,7,8.

Now a 386DX will also generate 2 32bit requests but the bus, being 32bit will actually honor them so it will read

32bit from address 4
32bit from address 8

So the 386DX will discard bytes 4,9,10,11 and keep bytes 5,6,7,8.

Either way the SX did 4 bus transactions while the DX did only 2.

Am I understand this better now?

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 3 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Also, in modern world, for a prominent GPU (I won't say which) for which I wrote drivers for, if you send a unaligned 32bit address load (or less than 32bit) down the bus, it will also send a mask, so there will be no reads outside the mask. This will prevent things like page faults when you try to read a 16bit quantity at the end of mapped memory (mapped page). Since the bus would be 64 or 32bit it might decide it will read all 32bit when you really only asked for 16 therefore reading pass the mapped memory and causing the fault. Which it should not.

But then again, different GPUs/CPUs behave differently.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 4 of 36, by superfury

User metadata
Rank l33t++
Rank
l33t++

That's exactly what I thought, reading your first posts: If you do a byte read from VGA VRAM, then the resulting 32-bit access would cause the latches to read the wrong bytes(e.g. addr+3 on an alligned 32-bit address, or address+7 on an unalligned one), thus messing up applications using those. Of course, such a mask would be required for compatibility with those(and faults).

UniPCemu currently only breaks it up into 16-bit(word aligned) or 8-bit(no alignment) accesses.
So a dword read on address 1 would read 2,3,4,5 as bytes, thus 4 bus cycles. But at address 2 it will take 2 cycles, because of dword alignment.

OPTINLINE byte BIU_isfulltransfer()
{
INLINEREGISTER byte result;
result = 0; //Default: byte transfer!
if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Aligned 16-bit access?
{
if ((EMULATED_CPU>=CPU_80386) || ((EMULATED_CPU<=CPU_80286) && (CPU_databussize==0))) //16-bit+ bus available?
{
result = 1; //Start a full transfer this very clock!
}
}
else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&3)==0)) //Aligned 32-bit access?
{
if ((EMULATED_CPU>=CPU_80386) && (CPU_databussize==0)) //32-bit processor with 32-bit bus?
{
result = 1; //Start a full transfer this very clock!
}
else if (EMULATED_CPU>=CPU_80386) //32-bit processor with 16-bit data bus?
{
result = 2; //Start a full transfer, broken in half(two 16-bit accesses)!
}
}
else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Word-Aligned 32-bit access, but not 32-bit aligned? Break up into word accesses, when possible!
{
if (EMULATED_CPU>=CPU_80386) //32-bit processor with 16-bit data bus at least?
{
result = 2; //Start a full transfer, broken in half(two 16-bit accesses)!
}
}
return result; //Give the result!
}

The BIU memory/bus(i/o ports) access core will handle bytes, words or dword cycles based on that(essentially a start-stop pattern. E.g. byte will read and wait a cycle etc. word(or half word) will read a byte, read another byte and wait a cycle. dword will read the whole thing in a single cycle).

The port i/o(BUS accesses) simulates it instead, as the memory module talks in bytes(BIU-compatible), while i/o ports handlers talks in bytes/words/dwords(handler itself dripping down to the lowest compatible using masks, from 32-bit to 16-bit to 8-bit). So BUS accesses do a direct call to the i/o bus module, then simulate the normal memory-compatible access(same protocol as memory, just nothing read/written from bus/ram).

byte fulltransfer=0; //Are we to fully finish the transfer in one go?
OPTINLINE byte BIU_processRequests(byte memory_waitstates)
{
if (BIU[activeCPU].currentrequest) //Do we have a pending request we're handling? This is used for 16-bit and 32-bit requests!
{
CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
switch (BIU[activeCPU].currentrequest&REQUEST_TYPEMASK) //What kind of request?
{
//Memory operations!
case REQUEST_MMUREAD:
fulltransferMMUread:
//MMU_generateaddress(segdesc,*CPU[activeCPU].SEGMENT_REGISTERS[segdesc],offset,0,0,is_offset16); //Generate the address on flat memory!
BIU[activeCPU].currentresult |= (BIU_directrb((BIU[activeCPU].currentaddress),(((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)>>8))<<(BIU_access_readshift[((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)])); //Read subsequent byte!
BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch!
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
{
if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
{
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
}
else
{
BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
++BIU[activeCPU].currentaddress; //Next address!
if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
if (fulltransfer) goto fulltransferMMUread;
}
return 1; //Handled!
break;
case REQUEST_MMUWRITE:
fulltransferMMUwrite:
BIU_directwb((BIU[activeCPU].currentaddress),(BIU[activeCPU].currentpayload[0]>>(BIU_access_writeshift[((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)])&0xFF),((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)); //Write directly to memory now!
BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch!
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
{
if (BIU_response(1)) //Result given? We're giving OK!
{
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
}
else
{
BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
++BIU[activeCPU].currentaddress; //Next address!
if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
if (fulltransfer) goto fulltransferMMUwrite;
}
return 1; //Handled!
break;
//I/O operations!
case REQUEST_IOREAD:
fulltransferIOread:
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
{
if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
{
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
}
Show last 184 lines
				else
{
BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
++BIU[activeCPU].currentaddress; //Next address!
if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
if (fulltransfer) goto fulltransferIOread;
}
return 1; //Handled!
break;
case REQUEST_IOWRITE:
fulltransferIOwrite:
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
{
if (BIU_response(1)) //Result given? We're giving OK!
{
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
}
else
{
BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
++BIU[activeCPU].currentaddress; //Next address!
if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
if (fulltransfer) goto fulltransferIOwrite;
}
return 1; //Handled!
break;
default:
case REQUEST_NONE: //Unknown request?
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
break; //Ignore the entire request!
}
}
else if (BIU_haveRequest()) //Do we have a request to handle first?
{
if (BIU_readRequest(&BIU[activeCPU].currentrequest,&BIU[activeCPU].currentpayload[0],&BIU[activeCPU].currentpayload[1])) //Read the request, if available!
{
fulltransfer = 0; //Init full transfer flag!
switch (BIU[activeCPU].currentrequest&REQUEST_TYPEMASK) //What kind of request?
{
//Memory operations!
case REQUEST_MMUREAD:
CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
{
BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
}
BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
BIU[activeCPU].currentresult = ((BIU_directrb((BIU[activeCPU].currentaddress),0))<<BIU_access_readshift[0]); //Read first byte!
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
{
if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
{
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
else //Response failed?
{
BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)!
}
}
else
{
fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
++BIU[activeCPU].currentaddress; //Next address!
if (fulltransfer) goto fulltransferMMUread; //Start Full transfer, when available?
}
return 1; //Handled!
break;
case REQUEST_MMUWRITE:
CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
{
BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
}
BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
{
if (BIU_response(1)) //Result given? We're giving OK!
{
BIU_directwb((BIU[activeCPU].currentaddress),((BIU[activeCPU].currentpayload[0]>>BIU_access_writeshift[0])&0xFF),0); //Write directly to memory now!
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
else //Response failed? Try again!
{
BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request 8-bit half again(low byte)!
}
}
else //Busy request?
{
BIU_directwb((BIU[activeCPU].currentpayload[0]&0xFFFFFFFF),(byte)((BIU[activeCPU].currentpayload[0]>>BIU_access_writeshift[0])&0xFF),0); //Write directly to memory now!
fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
++BIU[activeCPU].currentaddress; //Next address!
if (fulltransfer) goto fulltransferMMUwrite; //Start Full transfer, when available?
}
return 1; //Handled!
break;
//I/O operations!
case REQUEST_IOREAD:
CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
{
BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
}
BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
if (BIU[activeCPU].currentrequest&REQUEST_32BIT) //32-bit?
{
BIU[activeCPU].currentresult = PORT_IN_D(BIU[activeCPU].currentaddress&0xFFFF); //Read byte!
}
else if (BIU[activeCPU].currentrequest&REQUEST_16BIT) //16-bit?
{
BIU[activeCPU].currentresult = PORT_IN_W(BIU[activeCPU].currentaddress&0xFFFF); //Read byte!
}
else //8-bit?
{
BIU[activeCPU].currentresult = PORT_IN_B(BIU[activeCPU].currentaddress&0xFFFF); //Read byte!
}
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
{
if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
{
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
else //Response failed?
{
BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)!
}
}
else
{
fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
++BIU[activeCPU].currentaddress; //Next address!
if (fulltransfer) goto fulltransferIOread; //Start Full transfer, when available?
}
return 1; //Handled!
break;
case REQUEST_IOWRITE:
CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
{
BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
}
BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
if (BIU[activeCPU].currentrequest&REQUEST_32BIT) //32-bit?
{
BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
PORT_OUT_D((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(uint_32)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now!
}
else if (BIU[activeCPU].currentrequest&REQUEST_16BIT) //16-bit?
{
BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
PORT_OUT_W((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(word)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now!
}
else //8-bit?
{
PORT_OUT_B((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(byte)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now!
}
if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
{
if (BIU_response(1)) //Result given? We're giving OK!
{
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
}
else //Response failed?
{
BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)!
}
}
else
{
fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
++BIU[activeCPU].currentaddress; //Next address!
if (fulltransfer) goto fulltransferIOwrite; //Start Full transfer, when available?
}
return 1; //Handled!
break;
default:
case REQUEST_NONE: //Unknown request?
BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
break; //Ignore the entire request!
}
}
}
return 0; //No requests left!
}

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:
[…]
Show full quote
else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Word-Aligned 32-bit access, but not 32-bit aligned? Break up into word accesses, when possible!

That is what I am doing too, but I think that is slightly wrong (if I understand Scali right). That is only true for 16bit bus. So I am reworking that do different things if it is a 32bit bus vs 16bit bus. I align the address down and then read more and apply the mask if neccessary.

A 32bit read from address 6 (multiple of 2 but not 4) will still result in 2 16bit reads from 6 and 8 on a 16bit bus.

But on a 32bit bus it will result in 2 32bit reads, one from address 4 and one from address 8 (with a mask).

I am looking at an old CPU identification code written in assembly (WHICHCPU) which when detecting if it is a DX or an SX chip it does 8k LODSD once from address 1 and once from address 2.

It expects the SX to be faster from address 2 but the DX to be slow in both cases. Which substantiates what I said above.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 6 of 36, by superfury

User metadata
Rank l33t++
Rank
l33t++

Those accesses from 6/8 don't make any difference in timings? Either two 32-bit accesses or two 16-bit accesses. It's two accesses either way(with half of each discarded on 32-bits), so no difference in timings? So might as well use 16-bit accesses in that case and not needing to use that complicated masking(and hardware which will need complex logic to do that, which is heavy processing)?

Essentially what my code does is: 32-bit aligned? Then 32-bit when used. Else, 16-bit aligned for 32-bit/16-bit? Then 16-bit accesses. Otherwise, perform 8-bit accesses only. So aligned accesses are then fastest(when <=data width, so 32-bit with 32-bit alignment, 16-bit with 16-bit alignment, 8-bit always). Then unaligned but word aligned(32-bit with 2,6,8 etc.), slowest being byte aligned(32/16-bit with address 1/3/5/7 etc.). Although dword on address 3 will result in 4 byte fetches instead of 2 (d)word fetches.

Edit: Since the 80386DX wasn't on a modern motherboard with those dword masks(didn't exist back then?), the break-up logic still is valid? Either 1 dword(aligned at mod 4), 2 words(word aligned at mod 2) or 4 bytes(not aligned)? Did that mask even exist on a 386DX?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 7 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Essentially what my code does is: 32-bit aligned? Then 32-bit when used. Else, 16-bit aligned for 32-bit/16-bit? Then 16-bit accesses. Otherwise, perform 8-bit accesses only. So aligned accesses are then fastest(when <=data width, so 32-bit with 32-bit alignment, 16-bit with 16-bit alignment, 8-bit always). Then unaligned but word aligned(32-bit with 2,6,8 etc.), slowest being byte aligned(32/16-bit with address 1/3/5/7 etc.). Although dword on address 3 will result in 4 byte fetches instead of 2 (d)word fetches.

That is only correct for 16bit bus. Not for 32. On a 386DX a dword access on address 3 should results in 2dword accesses. You cannot break it down in 16bit or 8 bit accesses because the bus is 32bit, it can only read 32bit data. See this from the 386 manual I linked above:

When used in a configuration with a 32-bit bus, actual transfers of data between processor and memory take place in units of dou […]
Show full quote

When used in a configuration with a 32-bit bus, actual
transfers of data between processor and memory take place in units of
doublewords beginning at addresses evenly divisible by four; however, the
processor converts requests for misaligned words or doublewords into the
appropriate sequences of requests acceptable to the memory interface. Such
misaligned data transfers reduce performance by requiring extra memory
cycles.

The part that I underlined above is what Scali was saying in that both a 16bit bus and a 32bit bus will always read more if they have to read from unalined address. They cannot break it into 8bit transfers and just read what they need.

In CAPE my memory code behaves somewhat similar to yours but I realize now that is wrong.

The question that I had is: can the 386DX issue byte reads on the 32bit bus? I believe it cannot. I am still not 100% sure.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 8 of 36, by superfury

User metadata
Rank l33t++
Rank
l33t++

I'd assume it'd have to be able to do 8-bit reads/writes. Imagine reading VGA VRAM with 32-bit reads only. That would make it incompatible with all (S)VGA read/write modes. Like mode 1 for moving 4 planes at once using a byte read/write using MOVSB causing the wrong bytes to be latched into memory(it would result in byte 0(planes 0-3) being latched and written to byte 0(planes 0-3), but on a 386DX would instead cause byte 3(planes 0-3) to be written to byte 0-3(planes 0-3)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

We have to treat ISA BUS differently. There is 8bit ISA and 16bit ISA. So by the time the request reaches the VGA card it is not in the form the CPU sent it. Those should be ok to be separated in byte accesses. I believe the a 16bit ISA VGA card can understand 8bit access vs 16bit access, but I am not 100% sure.

What I am talking about is memory, as in RAM.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 10 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Superfury, can you configure your emulator to do either DX or SX 386? If you can, can you please run WHICHCPU? I found it in the harddisk image (which I believe you have too) from 8086tiny emulator. If you do not have it I can give it to you.

In my case, it detects only SX.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 11 of 36, by Scali

User metadata
Rank l33t
Rank
l33t
vladstamate wrote:

We have to treat ISA BUS differently. There is 8bit ISA and 16bit ISA. So by the time the request reaches the VGA card it is not in the form the CPU sent it. Those should be ok to be separated in byte accesses. I believe the a 16bit ISA VGA card can understand 8bit access vs 16bit access, but I am not 100% sure.

The ISA bus has a special lines to indicate 16-bit transfers: http://pinouts.ru/Slots/ISA_pinout.shtml
So ISA slots work as 8-bit by default, but a 16-bit card can signal on the extended part of a 16-bit ISA slot that it uses 16-bit memory or IO transfers.
See here for more info (signals SBHE, MEMCS16 and IO16):
http://pinouts.ru/Slots/ISA_pinout.shtml

In short, the system will pull SBHE low, and the card has to respond with MEMCS16/IOCS16 to accept a 16-bit transfer. if the card does not, the system will split up the transfer in two 8-bit transfers (for backward compatibility, 8-bit slots have no SBHE and MEMCS16/IOCS16).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 12 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Thank you Scali !

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 13 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Ok, CAPE now implements 16bit and 32bit buses properly (I have separate classes for each). That is it produces bus transactions of the size of the bus (either 16bit or 32bit) and then masks out data.

So for a 32bit load from offset 3 or 6 I see this:

Offset   386SX(16bit bus)        386DX(32bit bus)
=================================================
3 4x16bit transactions 2x32bit transactions
6 2x16bit transactions 2x32bit transactions

So now whichcpu.exe correctly determines wether CAPE is emulating a 386DX or a 386SX.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 14 of 36, by Scali

User metadata
Rank l33t
Rank
l33t
vladstamate wrote:
So for a 32bit load from offset 3 or 6 I see this: […]
Show full quote

So for a 32bit load from offset 3 or 6 I see this:

Offset   386SX(16bit bus)        386DX(32bit bus)
=================================================
3 4x16bit transactions 2x32bit transactions
6 2x16bit transactions 2x32bit transactions

That's interesting though... apparently it can eliminate 2 redundant 16-bit loads at offset 6, but it cannot eliminate a single redundant 16-bit load at offset 3.
Beacuse, worst case you always only need to load 3 16-bit words for an unaligned 32-bit word.
So that may be the 'thinking in 32-bit'... It creates one or two 32-bit fetches depending on alignment, and an entire 32-bit fetch can be eliminated as being redundant... But the 16-bit accesses can't be eliminated.
This is an interesting thing to verify on hardware.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 15 of 36, by superfury

User metadata
Rank l33t++
Rank
l33t++

It still seems odd to always perform 32-bit accesses or 16-bit accesses. Imagine a 32-bit access on VGA VRAM window edge case(like prefetches) at 9FFFF when reading a byte(valid on protection). It would fetch VRAM as well in your case, even when masking partly off? So 9FFFF-A0003, causing VRAM to be latched when always doing 32-bit accesses, even when further broken up afterwards with a mask(it's still being read&latched on a VGA)?

Edit: Reading http://www.phatcode.net/res/260/files/html/Sy … nizationa2.html , it seems that an odd memory address might even generate three memory accesses? One at x(byte), one at x+1(word) and one at x+3(byte)? That would happen on every odd address? But no other bytes would actually be read with 32-bit quantities only?

Last edited by superfury on 2018-01-21, 20:32. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 16 of 36, by Scali

User metadata
Rank l33t
Rank
l33t

I think the VGA latches simply respond to what address is on the bus.
So unless you actually fetch a word or dword starting at A0000h or higher, VGA won't 'see' it.

Having said that, there are indeed 'wraparound' issues when doing unaligned word/dword reads or writes at the end of a segment.
I ran into a bug on my 286 at one point, and the reason was that I used rep movsw starting on an unaligned address.
This worked fine on an 8088, but on a 286 I got unexpected side-effects. The write didn't wrap around as I expected.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 17 of 36, by superfury

User metadata
Rank l33t++
Rank
l33t++

What about a protected mode segment pointing to base 9FFFF? What would a byte read from offset 0 do? Will it read into VRAM? Or will it just read 9FFFD?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 18 of 36, by superfury

User metadata
Rank l33t++
Rank
l33t++

Just wanted to try running the app(whichcpu) when I noticed a bug causing the disk image to be unreadable by MS-DOS(Wrongly reported default CHS values). After having fixed those, the MS-DOS 6.22 sfdimg/img disk images became unreadable(data 100MB buffer img) and after fixing the CHS formation(using bytes instead of sectors it expects in the formula), the main hdd image(sfdimg disk) refused booting MS-DOS 6.22 as well. Then after creating a new disk image(to move files later using WinImage and paritioning it, after rebooting(80386 XT configuration on UniPCemu) trying format the FDD emulation started to fail it seems(either that or the HDD emulation)? There were long delays on reading port 1F7(new hdd image) during booting, but even running "format c: /q /u /s" after the long during boot seems to crap out the FDC emulation somehow(Turbo XT/XTIDE BIOS) with some drive not ready error?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 19 of 36, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

It still seems odd to always perform 32-bit accesses or 16-bit accesses.

But it is bus wide. Either 32bit or 16bit. From the page you quoted ( http://www.phatcode.net/res/260/files/html/Sy … nizationa2.html):

The address placed on the address bus is always some multiple of four. Using various "byte enable" lines, the CPU can select which of the four bytes at that address the software wants to access.

This is how CAPE currently implements is BIU and bus operations. Although I still need to honor graphics card (ISA) bus width, either 8bit or 16bit, have not done that yet.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/