For 386 processors (and up) it is my understanding that how unaligned memory accesses are handled is related based on bus width. Or isn't? What do I mean?
Well lets say I am doing a "LODSD" instruction from offset 1. On a 386SX processor this would first be broken up in 2 16bit accesses (as the bus of 386SX is 16bit). Then each one would have to be further broken up in 2 8bit accesses because of the offset being (odd) 1. So a 386SX would end up doing 4 accesses from offset 1:
8bit from offset 1
8bit from offset 2
8bit from offset 3
8bit from offset 4
Now I expect a 386DX to do exactly the same except via a different path as it would first try a 32bit accesss then realize it is unaligned then try 2 16bit address, those would be unaligned too.
Am I correct until now?
Now lets suppose we are doing same LODSD instruction but from offset 2 both on a 386SX and a 386DX. What happens now? I suspect the 386SX would be happy to do 2 16bit aligned accesses
16bit from offset 2
16bit from offset 4
Is it true, that yet again, the 386DX would still revert to 2 16bit access, same as the SX?
The 386DX would not start to shine until it is asked to do a LODSD read from offset 4 (or 0) because then it will issue a full 32bit memory access.
Am I correct? I spent some time looking at the 386 manual and this is what I came up with.
My understanding is that it is the other way around.
That is, a 386SX still 'thinks' like a 386DX in terms of memory access. So it initially generates 32-bit accesses, which are then broken up into 16-bit accesses.
Other than that, you can only have accesses of the size of your bus.
So 8-bit accesses only exist on an 8088.
CPUs with 16-bit buses will simply load an entire 16-bit word whenever they need a byte.
Alignment issues occur because words can only be accessed word-aligned.
So if you have an odd address for a word, it needs to fetch the two nearest words from even addresses, and then extract the relevant bytes from both words, to reconstruct the requested word (this is a somewhat special feature of x86, because of the legacy. Many modern CPUs simply do not let you access unaligned data in the first place. You see the same with SSE/AVX, where there are aligned load/store instructions and special (slower) instructions for unaligned access).
For 32-bit the same goes, except you now have dwords of 32-bit.
So worst-case you still need to do two accesses for an unaligned read.
This is where the '32-bit thinking' of the 386SX gets it in trouble: it will generate two 32-bit accesses worst-case, which will translate in 4 16-bit accesses down the line. However, in theory, it may have been able to have done it with just 2 or 3 aligned 16-bit accesses.
This is why a 386SX is somewhat slower than a 286. The 286 has more efficient memory access.
A related cycle-eater lurks beneath the 386SX chip, which is a 32-bit processor internally with only a 16-bit path to system memory. The numbers are different, but the way the cycle-eater operates is exactly the same. AT-compatible systems have 16-bit data buses, which can access a full 16-bit word at a time. The 386SX can process 32 bits (a doubleword) at a time, however, and loses a lot of time fetching that doubleword from memory in two halves.
Other than that, you can only have accesses of the size of your bus.
Thank you Scali. I think you are right. The bit above nails it too. So to put that in numbers a LODSD from address say 5 will generate the following bus transactions on a 386SX
32bit from address 4
32bit from address 8
which the actual 16 bit bus will transform into
16bit from address 4
16bit from address 6
16bit from address 8
16bit from address 10
it will discard bytes 4,9,10,11 and keep bytes 5,6,7,8.
Now a 386DX will also generate 2 32bit requests but the bus, being 32bit will actually honor them so it will read
32bit from address 4
32bit from address 8
So the 386DX will discard bytes 4,9,10,11 and keep bytes 5,6,7,8.
Either way the SX did 4 bus transactions while the DX did only 2.
Also, in modern world, for a prominent GPU (I won't say which) for which I wrote drivers for, if you send a unaligned 32bit address load (or less than 32bit) down the bus, it will also send a mask, so there will be no reads outside the mask. This will prevent things like page faults when you try to read a 16bit quantity at the end of mapped memory (mapped page). Since the bus would be 64 or 32bit it might decide it will read all 32bit when you really only asked for 16 therefore reading pass the mapped memory and causing the fault. Which it should not.
But then again, different GPUs/CPUs behave differently.
That's exactly what I thought, reading your first posts: If you do a byte read from VGA VRAM, then the resulting 32-bit access would cause the latches to read the wrong bytes(e.g. addr+3 on an alligned 32-bit address, or address+7 on an unalligned one), thus messing up applications using those. Of course, such a mask would be required for compatibility with those(and faults).
UniPCemu currently only breaks it up into 16-bit(word aligned) or 8-bit(no alignment) accesses.
So a dword read on address 1 would read 2,3,4,5 as bytes, thus 4 bus cycles. But at address 2 it will take 2 cycles, because of dword alignment.
1OPTINLINE byte BIU_isfulltransfer() 2{ 3 INLINEREGISTER byte result; 4 result = 0; //Default: byte transfer! 5 if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Aligned 16-bit access? 6 { 7 if ((EMULATED_CPU>=CPU_80386) || ((EMULATED_CPU<=CPU_80286) && (CPU_databussize==0))) //16-bit+ bus available? 8 { 9 result = 1; //Start a full transfer this very clock! 10 } 11 } 12 else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&3)==0)) //Aligned 32-bit access? 13 { 14 if ((EMULATED_CPU>=CPU_80386) && (CPU_databussize==0)) //32-bit processor with 32-bit bus? 15 { 16 result = 1; //Start a full transfer this very clock! 17 } 18 else if (EMULATED_CPU>=CPU_80386) //32-bit processor with 16-bit data bus? 19 { 20 result = 2; //Start a full transfer, broken in half(two 16-bit accesses)! 21 } 22 } 23 else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Word-Aligned 32-bit access, but not 32-bit aligned? Break up into word accesses, when possible! 24 { 25 if (EMULATED_CPU>=CPU_80386) //32-bit processor with 16-bit data bus at least? 26 { 27 result = 2; //Start a full transfer, broken in half(two 16-bit accesses)! 28 } 29 } 30 return result; //Give the result! 31}
The BIU memory/bus(i/o ports) access core will handle bytes, words or dword cycles based on that(essentially a start-stop pattern. E.g. byte will read and wait a cycle etc. word(or half word) will read a byte, read another byte and wait a cycle. dword will read the whole thing in a single cycle).
The port i/o(BUS accesses) simulates it instead, as the memory module talks in bytes(BIU-compatible), while i/o ports handlers talks in bytes/words/dwords(handler itself dripping down to the lowest compatible using masks, from 32-bit to 16-bit to 8-bit). So BUS accesses do a direct call to the i/o bus module, then simulate the normal memory-compatible access(same protocol as memory, just nothing read/written from bus/ram).
1byte fulltransfer=0; //Are we to fully finish the transfer in one go? 2OPTINLINE byte BIU_processRequests(byte memory_waitstates) 3{ 4 if (BIU[activeCPU].currentrequest) //Do we have a pending request we're handling? This is used for 16-bit and 32-bit requests! 5 { 6 CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles! 7 switch (BIU[activeCPU].currentrequest&REQUEST_TYPEMASK) //What kind of request? 8 { 9 //Memory operations! 10 case REQUEST_MMUREAD: 11 fulltransferMMUread: 12 //MMU_generateaddress(segdesc,*CPU[activeCPU].SEGMENT_REGISTERS[segdesc],offset,0,0,is_offset16); //Generate the address on flat memory! 13 BIU[activeCPU].currentresult |= (BIU_directrb((BIU[activeCPU].currentaddress),(((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)>>8))<<(BIU_access_readshift[((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)])); //Read subsequent byte! 14 BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch! 15 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request? 16 { 17 if (BIU_response(BIU[activeCPU].currentresult)) //Result given? 18 { 19 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 20 } 21 } 22 else 23 { 24 BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)! 25 ++BIU[activeCPU].currentaddress; //Next address! 26 if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer? 27 if (fulltransfer) goto fulltransferMMUread; 28 } 29 return 1; //Handled! 30 break; 31 case REQUEST_MMUWRITE: 32 fulltransferMMUwrite: 33 BIU_directwb((BIU[activeCPU].currentaddress),(BIU[activeCPU].currentpayload[0]>>(BIU_access_writeshift[((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)])&0xFF),((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)); //Write directly to memory now! 34 BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch! 35 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request? 36 { 37 if (BIU_response(1)) //Result given? We're giving OK! 38 { 39 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 40 } 41 } 42 else 43 { 44 BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)! 45 ++BIU[activeCPU].currentaddress; //Next address! 46 if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer? 47 if (fulltransfer) goto fulltransferMMUwrite; 48 } 49 return 1; //Handled! 50 break; 51 //I/O operations! 52 case REQUEST_IOREAD: 53 fulltransferIOread: 54 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request? 55 { 56 if (BIU_response(BIU[activeCPU].currentresult)) //Result given? 57 { 58 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 59 } 60 }
…Show last 184 lines
61 else 62 { 63 BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)! 64 ++BIU[activeCPU].currentaddress; //Next address! 65 if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer? 66 if (fulltransfer) goto fulltransferIOread; 67 } 68 return 1; //Handled! 69 break; 70 case REQUEST_IOWRITE: 71 fulltransferIOwrite: 72 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request? 73 { 74 if (BIU_response(1)) //Result given? We're giving OK! 75 { 76 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 77 } 78 } 79 else 80 { 81 BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)! 82 ++BIU[activeCPU].currentaddress; //Next address! 83 if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer? 84 if (fulltransfer) goto fulltransferIOwrite; 85 } 86 return 1; //Handled! 87 break; 88 default: 89 case REQUEST_NONE: //Unknown request? 90 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 91 break; //Ignore the entire request! 92 } 93 } 94 else if (BIU_haveRequest()) //Do we have a request to handle first? 95 { 96 if (BIU_readRequest(&BIU[activeCPU].currentrequest,&BIU[activeCPU].currentpayload[0],&BIU[activeCPU].currentpayload[1])) //Read the request, if available! 97 { 98 fulltransfer = 0; //Init full transfer flag! 99 switch (BIU[activeCPU].currentrequest&REQUEST_TYPEMASK) //What kind of request? 100 { 101 //Memory operations! 102 case REQUEST_MMUREAD: 103 CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles! 104 if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit? 105 { 106 BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)! 107 } 108 BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use! 109 BIU[activeCPU].currentresult = ((BIU_directrb((BIU[activeCPU].currentaddress),0))<<BIU_access_readshift[0]); //Read first byte! 110 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request? 111 { 112 if (BIU_response(BIU[activeCPU].currentresult)) //Result given? 113 { 114 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 115 } 116 else //Response failed? 117 { 118 BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)! 119 } 120 } 121 else 122 { 123 fulltransfer = BIU_isfulltransfer(); //Are we a full transfer? 124 ++BIU[activeCPU].currentaddress; //Next address! 125 if (fulltransfer) goto fulltransferMMUread; //Start Full transfer, when available? 126 } 127 return 1; //Handled! 128 break; 129 case REQUEST_MMUWRITE: 130 CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles! 131 if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit? 132 { 133 BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)! 134 } 135 BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use! 136 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request? 137 { 138 if (BIU_response(1)) //Result given? We're giving OK! 139 { 140 BIU_directwb((BIU[activeCPU].currentaddress),((BIU[activeCPU].currentpayload[0]>>BIU_access_writeshift[0])&0xFF),0); //Write directly to memory now! 141 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 142 } 143 else //Response failed? Try again! 144 { 145 BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request 8-bit half again(low byte)! 146 } 147 } 148 else //Busy request? 149 { 150 BIU_directwb((BIU[activeCPU].currentpayload[0]&0xFFFFFFFF),(byte)((BIU[activeCPU].currentpayload[0]>>BIU_access_writeshift[0])&0xFF),0); //Write directly to memory now! 151 fulltransfer = BIU_isfulltransfer(); //Are we a full transfer? 152 ++BIU[activeCPU].currentaddress; //Next address! 153 if (fulltransfer) goto fulltransferMMUwrite; //Start Full transfer, when available? 154 } 155 return 1; //Handled! 156 break; 157 //I/O operations! 158 case REQUEST_IOREAD: 159 CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles! 160 if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit? 161 { 162 BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)! 163 } 164 BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use! 165 if (BIU[activeCPU].currentrequest&REQUEST_32BIT) //32-bit? 166 { 167 BIU[activeCPU].currentresult = PORT_IN_D(BIU[activeCPU].currentaddress&0xFFFF); //Read byte! 168 } 169 else if (BIU[activeCPU].currentrequest&REQUEST_16BIT) //16-bit? 170 { 171 BIU[activeCPU].currentresult = PORT_IN_W(BIU[activeCPU].currentaddress&0xFFFF); //Read byte! 172 } 173 else //8-bit? 174 { 175 BIU[activeCPU].currentresult = PORT_IN_B(BIU[activeCPU].currentaddress&0xFFFF); //Read byte! 176 } 177 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request? 178 { 179 if (BIU_response(BIU[activeCPU].currentresult)) //Result given? 180 { 181 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 182 } 183 else //Response failed? 184 { 185 BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)! 186 } 187 } 188 else 189 { 190 fulltransfer = BIU_isfulltransfer(); //Are we a full transfer? 191 ++BIU[activeCPU].currentaddress; //Next address! 192 if (fulltransfer) goto fulltransferIOread; //Start Full transfer, when available? 193 } 194 return 1; //Handled! 195 break; 196 case REQUEST_IOWRITE: 197 CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles! 198 if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit? 199 { 200 BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)! 201 } 202 BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use! 203 if (BIU[activeCPU].currentrequest&REQUEST_32BIT) //32-bit? 204 { 205 BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)! 206 PORT_OUT_D((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(uint_32)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now! 207 } 208 else if (BIU[activeCPU].currentrequest&REQUEST_16BIT) //16-bit? 209 { 210 BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)! 211 PORT_OUT_W((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(word)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now! 212 } 213 else //8-bit? 214 { 215 PORT_OUT_B((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(byte)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now! 216 } 217 if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request? 218 { 219 if (BIU_response(1)) //Result given? We're giving OK! 220 { 221 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 222 } 223 else //Response failed? 224 { 225 BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)! 226 } 227 } 228 else 229 { 230 fulltransfer = BIU_isfulltransfer(); //Are we a full transfer? 231 ++BIU[activeCPU].currentaddress; //Next address! 232 if (fulltransfer) goto fulltransferIOwrite; //Start Full transfer, when available? 233 } 234 return 1; //Handled! 235 break; 236 default: 237 case REQUEST_NONE: //Unknown request? 238 BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished! 239 break; //Ignore the entire request! 240 } 241 } 242 } 243 return 0; //No requests left! 244}
1else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Word-Aligned 32-bit access, but not 32-bit aligned? Break up into word accesses, when possible!
That is what I am doing too, but I think that is slightly wrong (if I understand Scali right). That is only true for 16bit bus. So I am reworking that do different things if it is a 32bit bus vs 16bit bus. I align the address down and then read more and apply the mask if neccessary.
A 32bit read from address 6 (multiple of 2 but not 4) will still result in 2 16bit reads from 6 and 8 on a 16bit bus.
But on a 32bit bus it will result in 2 32bit reads, one from address 4 and one from address 8 (with a mask).
I am looking at an old CPU identification code written in assembly (WHICHCPU) which when detecting if it is a DX or an SX chip it does 8k LODSD once from address 1 and once from address 2.
It expects the SX to be faster from address 2 but the DX to be slow in both cases. Which substantiates what I said above.
Those accesses from 6/8 don't make any difference in timings? Either two 32-bit accesses or two 16-bit accesses. It's two accesses either way(with half of each discarded on 32-bits), so no difference in timings? So might as well use 16-bit accesses in that case and not needing to use that complicated masking(and hardware which will need complex logic to do that, which is heavy processing)?
Essentially what my code does is: 32-bit aligned? Then 32-bit when used. Else, 16-bit aligned for 32-bit/16-bit? Then 16-bit accesses. Otherwise, perform 8-bit accesses only. So aligned accesses are then fastest(when <=data width, so 32-bit with 32-bit alignment, 16-bit with 16-bit alignment, 8-bit always). Then unaligned but word aligned(32-bit with 2,6,8 etc.), slowest being byte aligned(32/16-bit with address 1/3/5/7 etc.). Although dword on address 3 will result in 4 byte fetches instead of 2 (d)word fetches.
Edit: Since the 80386DX wasn't on a modern motherboard with those dword masks(didn't exist back then?), the break-up logic still is valid? Either 1 dword(aligned at mod 4), 2 words(word aligned at mod 2) or 4 bytes(not aligned)? Did that mask even exist on a 386DX?
Essentially what my code does is: 32-bit aligned? Then 32-bit when used. Else, 16-bit aligned for 32-bit/16-bit? Then 16-bit accesses. Otherwise, perform 8-bit accesses only. So aligned accesses are then fastest(when <=data width, so 32-bit with 32-bit alignment, 16-bit with 16-bit alignment, 8-bit always). Then unaligned but word aligned(32-bit with 2,6,8 etc.), slowest being byte aligned(32/16-bit with address 1/3/5/7 etc.). Although dword on address 3 will result in 4 byte fetches instead of 2 (d)word fetches.
That is only correct for 16bit bus. Not for 32. On a 386DX a dword access on address 3 should results in 2dword accesses. You cannot break it down in 16bit or 8 bit accesses because the bus is 32bit, it can only read 32bit data. See this from the 386 manual I linked above:
When used in a configuration with a 32-bit bus, actual
transfers of data between processor and memory take place in units of
dou […] Show full quote
When used in a configuration with a 32-bit bus, actual
transfers of data between processor and memory take place in units of
doublewords beginning at addresses evenly divisible by four; however, the
processor converts requests for misaligned words or doublewords into the
appropriate sequences of requests acceptable to the memory interface. Such
misaligned data transfers reduce performance by requiring extra memory
cycles.
The part that I underlined above is what Scali was saying in that both a 16bit bus and a 32bit bus will always read more if they have to read from unalined address. They cannot break it into 8bit transfers and just read what they need.
In CAPE my memory code behaves somewhat similar to yours but I realize now that is wrong.
The question that I had is: can the 386DX issue byte reads on the 32bit bus? I believe it cannot. I am still not 100% sure.
I'd assume it'd have to be able to do 8-bit reads/writes. Imagine reading VGA VRAM with 32-bit reads only. That would make it incompatible with all (S)VGA read/write modes. Like mode 1 for moving 4 planes at once using a byte read/write using MOVSB causing the wrong bytes to be latched into memory(it would result in byte 0(planes 0-3) being latched and written to byte 0(planes 0-3), but on a 386DX would instead cause byte 3(planes 0-3) to be written to byte 0-3(planes 0-3)?
We have to treat ISA BUS differently. There is 8bit ISA and 16bit ISA. So by the time the request reaches the VGA card it is not in the form the CPU sent it. Those should be ok to be separated in byte accesses. I believe the a 16bit ISA VGA card can understand 8bit access vs 16bit access, but I am not 100% sure.
Superfury, can you configure your emulator to do either DX or SX 386? If you can, can you please run WHICHCPU? I found it in the harddisk image (which I believe you have too) from 8086tiny emulator. If you do not have it I can give it to you.
We have to treat ISA BUS differently. There is 8bit ISA and 16bit ISA. So by the time the request reaches the VGA card it is not in the form the CPU sent it. Those should be ok to be separated in byte accesses. I believe the a 16bit ISA VGA card can understand 8bit access vs 16bit access, but I am not 100% sure.
The ISA bus has a special lines to indicate 16-bit transfers: http://pinouts.ru/Slots/ISA_pinout.shtml
So ISA slots work as 8-bit by default, but a 16-bit card can signal on the extended part of a 16-bit ISA slot that it uses 16-bit memory or IO transfers.
See here for more info (signals SBHE, MEMCS16 and IO16): http://pinouts.ru/Slots/ISA_pinout.shtml
In short, the system will pull SBHE low, and the card has to respond with MEMCS16/IOCS16 to accept a 16-bit transfer. if the card does not, the system will split up the transfer in two 8-bit transfers (for backward compatibility, 8-bit slots have no SBHE and MEMCS16/IOCS16).
Ok, CAPE now implements 16bit and 32bit buses properly (I have separate classes for each). That is it produces bus transactions of the size of the bus (either 16bit or 32bit) and then masks out data.
So for a 32bit load from offset 3 or 6 I see this:
That's interesting though... apparently it can eliminate 2 redundant 16-bit loads at offset 6, but it cannot eliminate a single redundant 16-bit load at offset 3.
Beacuse, worst case you always only need to load 3 16-bit words for an unaligned 32-bit word.
So that may be the 'thinking in 32-bit'... It creates one or two 32-bit fetches depending on alignment, and an entire 32-bit fetch can be eliminated as being redundant... But the 16-bit accesses can't be eliminated.
This is an interesting thing to verify on hardware.
It still seems odd to always perform 32-bit accesses or 16-bit accesses. Imagine a 32-bit access on VGA VRAM window edge case(like prefetches) at 9FFFF when reading a byte(valid on protection). It would fetch VRAM as well in your case, even when masking partly off? So 9FFFF-A0003, causing VRAM to be latched when always doing 32-bit accesses, even when further broken up afterwards with a mask(it's still being read&latched on a VGA)?
Edit: Reading http://www.phatcode.net/res/260/files/html/Sy … nizationa2.html , it seems that an odd memory address might even generate three memory accesses? One at x(byte), one at x+1(word) and one at x+3(byte)? That would happen on every odd address? But no other bytes would actually be read with 32-bit quantities only?
Last edited by superfury on 2018-01-21, 20:32. Edited 1 time in total.
I think the VGA latches simply respond to what address is on the bus.
So unless you actually fetch a word or dword starting at A0000h or higher, VGA won't 'see' it.
Having said that, there are indeed 'wraparound' issues when doing unaligned word/dword reads or writes at the end of a segment.
I ran into a bug on my 286 at one point, and the reason was that I used rep movsw starting on an unaligned address.
This worked fine on an 8088, but on a 286 I got unexpected side-effects. The write didn't wrap around as I expected.
What about a protected mode segment pointing to base 9FFFF? What would a byte read from offset 0 do? Will it read into VRAM? Or will it just read 9FFFD?
Just wanted to try running the app(whichcpu) when I noticed a bug causing the disk image to be unreadable by MS-DOS(Wrongly reported default CHS values). After having fixed those, the MS-DOS 6.22 sfdimg/img disk images became unreadable(data 100MB buffer img) and after fixing the CHS formation(using bytes instead of sectors it expects in the formula), the main hdd image(sfdimg disk) refused booting MS-DOS 6.22 as well. Then after creating a new disk image(to move files later using WinImage and paritioning it, after rebooting(80386 XT configuration on UniPCemu) trying format the FDD emulation started to fail it seems(either that or the HDD emulation)? There were long delays on reading port 1F7(new hdd image) during booting, but even running "format c: /q /u /s" after the long during boot seems to crap out the FDC emulation somehow(Turbo XT/XTIDE BIOS) with some drive not ready error?
The address placed on the address bus is always some multiple of four. Using various "byte enable" lines, the CPU can select which of the four bytes at that address the software wants to access.
This is how CAPE currently implements is BIU and bus operations. Although I still need to honor graphics card (ISA) bus width, either 8bit or 16bit, have not done that yet.