Unaligned memory access and bus width \ VOGONS

Reply 1 of 36, by Scali

Posted on 2018-01-20, 00:42

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

My understanding is that it is the other way around.
That is, a 386SX still 'thinks' like a 386DX in terms of memory access. So it initially generates 32-bit accesses, which are then broken up into 16-bit accesses.

Other than that, you can only have accesses of the size of your bus.
So 8-bit accesses only exist on an 8088.
CPUs with 16-bit buses will simply load an entire 16-bit word whenever they need a byte.
Alignment issues occur because words can only be accessed word-aligned.
So if you have an odd address for a word, it needs to fetch the two nearest words from even addresses, and then extract the relevant bytes from both words, to reconstruct the requested word (this is a somewhat special feature of x86, because of the legacy. Many modern CPUs simply do not let you access unaligned data in the first place. You see the same with SSE/AVX, where there are aligned load/store instructions and special (slower) instructions for unaligned access).

For 32-bit the same goes, except you now have dwords of 32-bit.
So worst-case you still need to do two accesses for an unaligned read.

This is where the '32-bit thinking' of the 386SX gets it in trouble: it will generate two 32-bit accesses worst-case, which will translate in 4 16-bit accesses down the line. However, in theory, it may have been able to have done it with just 2 or 3 aligned 16-bit accesses.
This is why a 386SX is somewhat slower than a 286. The 286 has more efficient memory access.

This was also mentioned in Abrash' black book:
http://www.jagregory.com/abrash-black-book/

A related cycle-eater lurks beneath the 386SX chip, which is a 32-bit processor internally with only a 16-bit path to system memory. The numbers are different, but the way the cycle-eater operates is exactly the same. AT-compatible systems have 16-bit data buses, which can access a full 16-bit word at a time. The 386SX can process 32 bits (a doubleword) at a time, however, and loses a lot of time fetching that doubleword from memory in two halves.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 2 of 36, by vladstamate

Posted on 2018-01-20, 01:46

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

Scali wrote:

Other than that, you can only have accesses of the size of your bus.

Thank you Scali. I think you are right. The bit above nails it too. So to put that in numbers a LODSD from address say 5 will generate the following bus transactions on a 386SX

32bit from address 4
32bit from address 8

which the actual 16 bit bus will transform into

16bit from address 4
16bit from address 6
16bit from address 8
16bit from address 10

it will discard bytes 4,9,10,11 and keep bytes 5,6,7,8.

Now a 386DX will also generate 2 32bit requests but the bus, being 32bit will actually honor them so it will read

32bit from address 4
32bit from address 8

So the 386DX will discard bytes 4,9,10,11 and keep bytes 5,6,7,8.

Either way the SX did 4 bus transactions while the DX did only 2.

Am I understand this better now?

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 3 of 36, by vladstamate

Posted on 2018-01-20, 01:57

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

Also, in modern world, for a prominent GPU (I won't say which) for which I wrote drivers for, if you send a unaligned 32bit address load (or less than 32bit) down the bus, it will also send a mask, so there will be no reads outside the mask. This will prevent things like page faults when you try to read a 16bit quantity at the end of mapped memory (mapped page). Since the bus would be 64 or 32bit it might decide it will read all 32bit when you really only asked for 16 therefore reading pass the mapped memory and causing the fault. Which it should not.

But then again, different GPUs/CPUs behave differently.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 4 of 36, by superfury

Posted on 2018-01-20, 10:49

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5464
Joined: 2014-03-08, 11:25
Location: Netherlands

That's exactly what I thought, reading your first posts: If you do a byte read from VGA VRAM, then the resulting 32-bit access would cause the latches to read the wrong bytes(e.g. addr+3 on an alligned 32-bit address, or address+7 on an unalligned one), thus messing up applications using those. Of course, such a mask would be required for compatibility with those(and faults).

UniPCemu currently only breaks it up into 16-bit(word aligned) or 8-bit(no alignment) accesses.
So a dword read on address 1 would read 2,3,4,5 as bytes, thus 4 bus cycles. But at address 2 it will take 2 cycles, because of dword alignment.

1OPTINLINE byte BIU_isfulltransfer()
2{
3	INLINEREGISTER byte result;
4	result = 0; //Default: byte transfer!
5	if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Aligned 16-bit access?
6	{
7		if ((EMULATED_CPU>=CPU_80386) || ((EMULATED_CPU<=CPU_80286) && (CPU_databussize==0))) //16-bit+ bus available?
8		{
9			result = 1; //Start a full transfer this very clock!
10		}
11	}
12	else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&3)==0)) //Aligned 32-bit access?
13	{
14		if ((EMULATED_CPU>=CPU_80386) && (CPU_databussize==0)) //32-bit processor with 32-bit bus?
15		{
16			result = 1; //Start a full transfer this very clock!
17		}
18		else if (EMULATED_CPU>=CPU_80386) //32-bit processor with 16-bit data bus?
19		{
20			result = 2; //Start a full transfer, broken in half(two 16-bit accesses)!
21		}
22	}
23	else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Word-Aligned 32-bit access, but not 32-bit aligned? Break up into word accesses, when possible!
24	{
25		if (EMULATED_CPU>=CPU_80386) //32-bit processor with 16-bit data bus at least?
26		{
27			result = 2; //Start a full transfer, broken in half(two 16-bit accesses)!
28		}
29	}
30	return result; //Give the result!
31}

The BIU memory/bus(i/o ports) access core will handle bytes, words or dword cycles based on that(essentially a start-stop pattern. E.g. byte will read and wait a cycle etc. word(or half word) will read a byte, read another byte and wait a cycle. dword will read the whole thing in a single cycle).

The port i/o(BUS accesses) simulates it instead, as the memory module talks in bytes(BIU-compatible), while i/o ports handlers talks in bytes/words/dwords(handler itself dripping down to the lowest compatible using masks, from 32-bit to 16-bit to 8-bit). So BUS accesses do a direct call to the i/o bus module, then simulate the normal memory-compatible access(same protocol as memory, just nothing read/written from bus/ram).

1byte fulltransfer=0; //Are we to fully finish the transfer in one go?
2OPTINLINE byte BIU_processRequests(byte memory_waitstates)
3{
4	if (BIU[activeCPU].currentrequest) //Do we have a pending request we're handling? This is used for 16-bit and 32-bit requests!
5	{
6		CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
7		switch (BIU[activeCPU].currentrequest&REQUEST_TYPEMASK) //What kind of request?
8		{
9			//Memory operations!
10			case REQUEST_MMUREAD:
11				fulltransferMMUread:
12				//MMU_generateaddress(segdesc,*CPU[activeCPU].SEGMENT_REGISTERS[segdesc],offset,0,0,is_offset16); //Generate the address on flat memory!
13				BIU[activeCPU].currentresult |= (BIU_directrb((BIU[activeCPU].currentaddress),(((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)>>8))<<(BIU_access_readshift[((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)])); //Read subsequent byte!
14				BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch!
15				if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
16				{
17					if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
18					{
19						BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
20					}
21				}
22				else
23				{
24					BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
25					++BIU[activeCPU].currentaddress; //Next address!
26					if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
27					if (fulltransfer) goto fulltransferMMUread;
28				}
29				return 1; //Handled!
30				break;
31			case REQUEST_MMUWRITE:
32				fulltransferMMUwrite:
33				BIU_directwb((BIU[activeCPU].currentaddress),(BIU[activeCPU].currentpayload[0]>>(BIU_access_writeshift[((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)])&0xFF),((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)>>REQUEST_SUBSHIFT)); //Write directly to memory now!
34				BIU[activeCPU].waitstateRAMremaining += memory_waitstates; //Apply the waitstates for the fetch!
35				if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
36				{
37					if (BIU_response(1)) //Result given? We're giving OK!
38					{
39						BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
40					}
41				}
42				else
43				{
44					BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
45					++BIU[activeCPU].currentaddress; //Next address!
46					if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
47					if (fulltransfer) goto fulltransferMMUwrite;
48				}
49				return 1; //Handled!
50				break;
51			//I/O operations!
52			case REQUEST_IOREAD:
53				fulltransferIOread:
54				if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
55				{
56					if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
57					{
58						BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
59					}
60				}

…Show last 184 lines

61				else
62				{
63					BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
64					++BIU[activeCPU].currentaddress; //Next address!
65					if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
66					if (fulltransfer) goto fulltransferIOread;
67				}
68				return 1; //Handled!
69				break;
70			case REQUEST_IOWRITE:
71				fulltransferIOwrite:
72				if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==((BIU[activeCPU].currentrequest&REQUEST_16BIT)?REQUEST_SUB1:REQUEST_SUB3)) //Finished the request?
73				{
74					if (BIU_response(1)) //Result given? We're giving OK!
75					{
76						BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
77					}
78				}
79				else
80				{
81					BIU[activeCPU].currentrequest += REQUEST_SUB1; //Request next 8-bit half next(high byte)!
82					++BIU[activeCPU].currentaddress; //Next address!
83					if ((fulltransfer==2) && ((BIU[activeCPU].currentaddress&3)==2)) return 1; //Finished 16-bit half of a split 32-bit transfer?
84					if (fulltransfer) goto fulltransferIOwrite;
85				}
86				return 1; //Handled!
87				break;
88			default:
89			case REQUEST_NONE: //Unknown request?
90				BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
91				break; //Ignore the entire request!
92		}
93	}
94	else if (BIU_haveRequest()) //Do we have a request to handle first?
95	{
96		if (BIU_readRequest(&BIU[activeCPU].currentrequest,&BIU[activeCPU].currentpayload[0],&BIU[activeCPU].currentpayload[1])) //Read the request, if available!
97		{
98			fulltransfer = 0; //Init full transfer flag!
99			switch (BIU[activeCPU].currentrequest&REQUEST_TYPEMASK) //What kind of request?
100			{
101				//Memory operations!
102				case REQUEST_MMUREAD:
103					CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
104					if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
105					{
106						BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
107					}
108					BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
109					BIU[activeCPU].currentresult = ((BIU_directrb((BIU[activeCPU].currentaddress),0))<<BIU_access_readshift[0]); //Read first byte!
110					if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
111					{
112						if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
113						{
114							BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
115						}
116						else //Response failed?
117						{
118							BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)!
119						}
120					}
121					else
122					{
123						fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
124						++BIU[activeCPU].currentaddress; //Next address!
125						if (fulltransfer) goto fulltransferMMUread; //Start Full transfer, when available?
126					}
127					return 1; //Handled!
128					break;
129				case REQUEST_MMUWRITE:
130					CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
131					if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
132					{
133						BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
134					}
135					BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
136					if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
137					{
138						if (BIU_response(1)) //Result given? We're giving OK!
139						{
140							BIU_directwb((BIU[activeCPU].currentaddress),((BIU[activeCPU].currentpayload[0]>>BIU_access_writeshift[0])&0xFF),0); //Write directly to memory now!
141							BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
142						}
143						else //Response failed? Try again!
144						{
145							BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request 8-bit half again(low byte)!
146						}
147					}
148					else //Busy request?
149					{
150						BIU_directwb((BIU[activeCPU].currentpayload[0]&0xFFFFFFFF),(byte)((BIU[activeCPU].currentpayload[0]>>BIU_access_writeshift[0])&0xFF),0); //Write directly to memory now!
151						fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
152						++BIU[activeCPU].currentaddress; //Next address!
153						if (fulltransfer) goto fulltransferMMUwrite; //Start Full transfer, when available?
154					}
155					return 1; //Handled!
156					break;
157				//I/O operations!
158				case REQUEST_IOREAD:
159					CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
160					if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
161					{
162						BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
163					}
164					BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
165					if (BIU[activeCPU].currentrequest&REQUEST_32BIT) //32-bit?
166					{
167						BIU[activeCPU].currentresult = PORT_IN_D(BIU[activeCPU].currentaddress&0xFFFF); //Read byte!
168					}
169					else if (BIU[activeCPU].currentrequest&REQUEST_16BIT) //16-bit?
170					{
171						BIU[activeCPU].currentresult = PORT_IN_W(BIU[activeCPU].currentaddress&0xFFFF); //Read byte!
172					}
173					else //8-bit?
174					{
175						BIU[activeCPU].currentresult = PORT_IN_B(BIU[activeCPU].currentaddress&0xFFFF); //Read byte!
176					}
177					if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
178					{
179						if (BIU_response(BIU[activeCPU].currentresult)) //Result given?
180						{
181							BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
182						}
183						else //Response failed?
184						{
185							BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)!
186						}
187					}
188					else
189					{
190						fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
191						++BIU[activeCPU].currentaddress; //Next address!
192						if (fulltransfer) goto fulltransferIOread; //Start Full transfer, when available?
193					}
194					return 1; //Handled!
195					break;
196				case REQUEST_IOWRITE:
197					CPU[activeCPU].BUSactive = 1; //Start memory or BUS cycles!
198					if ((BIU[activeCPU].currentrequest&REQUEST_16BIT) || (BIU[activeCPU].currentrequest&REQUEST_32BIT)) //16/32-bit?
199					{
200						BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
201					}
202					BIU[activeCPU].currentaddress = (BIU[activeCPU].currentpayload[0]&0xFFFFFFFF); //Address to use!
203					if (BIU[activeCPU].currentrequest&REQUEST_32BIT) //32-bit?
204					{
205						BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
206						PORT_OUT_D((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(uint_32)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now!									
207					}
208					else if (BIU[activeCPU].currentrequest&REQUEST_16BIT) //16-bit?
209					{
210						BIU[activeCPU].currentrequest |= REQUEST_SUB1; //Request 16-bit half next(high byte)!
211						PORT_OUT_W((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(word)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now!									
212					}
213					else //8-bit?
214					{
215						PORT_OUT_B((word)(BIU[activeCPU].currentpayload[0]&0xFFFF),(byte)((BIU[activeCPU].currentpayload[0]>>32)&0xFFFFFFFF)); //Write to memory now!									
216					}
217					if ((BIU[activeCPU].currentrequest&REQUEST_SUBMASK)==REQUEST_SUB0) //Finished the request?
218					{
219						if (BIU_response(1)) //Result given? We're giving OK!
220						{
221							BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
222						}
223						else //Response failed?
224						{
225							BIU[activeCPU].currentrequest &= ~REQUEST_SUB1; //Request low 8-bit half again(low byte)!
226						}
227					}
228					else
229					{
230						fulltransfer = BIU_isfulltransfer(); //Are we a full transfer?
231						++BIU[activeCPU].currentaddress; //Next address!
232						if (fulltransfer) goto fulltransferIOwrite; //Start Full transfer, when available?
233					}
234					return 1; //Handled!
235					break;
236				default:
237				case REQUEST_NONE: //Unknown request?
238					BIU[activeCPU].currentrequest = REQUEST_NONE; //No request anymore! We're finished!
239					break; //Ignore the entire request!
240			}
241		}
242	}
243	return 0; //No requests left!
244}

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 5 of 36, by vladstamate

Posted on 2018-01-20, 13:23

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

superfury wrote:

[…]
Show full quote

1else if ((BIU[activeCPU].currentrequest&REQUEST_32BIT) && ((BIU[activeCPU].currentaddress&1)==0)) //Word-Aligned 32-bit access, but not 32-bit aligned? Break up into word accesses, when possible!

That is what I am doing too, but I think that is slightly wrong (if I understand Scali right). That is only true for 16bit bus. So I am reworking that do different things if it is a 32bit bus vs 16bit bus. I align the address down and then read more and apply the mask if neccessary.

A 32bit read from address 6 (multiple of 2 but not 4) will still result in 2 16bit reads from 6 and 8 on a 16bit bus.

But on a 32bit bus it will result in 2 32bit reads, one from address 4 and one from address 8 (with a mask).

I am looking at an old CPU identification code written in assembly (WHICHCPU) which when detecting if it is a DX or an SX chip it does 8k LODSD once from address 1 and once from address 2.

It expects the SX to be faster from address 2 but the DX to be slow in both cases. Which substantiates what I said above.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 6 of 36, by superfury

Posted on 2018-01-20, 13:43

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5464
Joined: 2014-03-08, 11:25
Location: Netherlands

Those accesses from 6/8 don't make any difference in timings? Either two 32-bit accesses or two 16-bit accesses. It's two accesses either way(with half of each discarded on 32-bits), so no difference in timings? So might as well use 16-bit accesses in that case and not needing to use that complicated masking(and hardware which will need complex logic to do that, which is heavy processing)?

Essentially what my code does is: 32-bit aligned? Then 32-bit when used. Else, 16-bit aligned for 32-bit/16-bit? Then 16-bit accesses. Otherwise, perform 8-bit accesses only. So aligned accesses are then fastest(when <=data width, so 32-bit with 32-bit alignment, 16-bit with 16-bit alignment, 8-bit always). Then unaligned but word aligned(32-bit with 2,6,8 etc.), slowest being byte aligned(32/16-bit with address 1/3/5/7 etc.). Although dword on address 3 will result in 4 byte fetches instead of 2 (d)word fetches.

Edit: Since the 80386DX wasn't on a modern motherboard with those dword masks(didn't exist back then?), the break-up logic still is valid? Either 1 dword(aligned at mod 4), 2 words(word aligned at mod 2) or 4 bytes(not aligned)? Did that mask even exist on a 386DX?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 7 of 36, by vladstamate

Posted on 2018-01-20, 15:07

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

superfury wrote:
Essentially what my code does is: 32-bit aligned? Then 32-bit when used. Else, 16-bit aligned for 32-bit/16-bit? Then 16-bit accesses. Otherwise, perform 8-bit accesses only. So aligned accesses are then fastest(when <=data width, so 32-bit with 32-bit alignment, 16-bit with 16-bit alignment, 8-bit always). Then unaligned but word aligned(32-bit with 2,6,8 etc.), slowest being byte aligned(32/16-bit with address 1/3/5/7 etc.). Although dword on address 3 will result in 4 byte fetches instead of 2 (d)word fetches.

That is only correct for 16bit bus. Not for 32. On a 386DX a dword access on address 3 should results in 2dword accesses. You cannot break it down in 16bit or 8 bit accesses because the bus is 32bit, it can only read 32bit data. See this from the 386 manual I linked above:

When used in a configuration with a 32-bit bus, actual transfers of data between processor and memory take place in units of dou […]
Show full quote
When used in a configuration with a 32-bit bus, actual
transfers of data between processor and memory take place in units of
doublewords beginning at addresses evenly divisible by four; however, the
processor converts requests for misaligned words or doublewords into the
appropriate sequences of requests acceptable to the memory interface. Such
misaligned data transfers reduce performance by requiring extra memory
cycles.

The part that I underlined above is what Scali was saying in that both a 16bit bus and a 32bit bus will always read more if they have to read from unalined address. They cannot break it into 8bit transfers and just read what they need.

In CAPE my memory code behaves somewhat similar to yours but I realize now that is wrong.

The question that I had is: can the 386DX issue byte reads on the 32bit bus? I believe it cannot. I am still not 100% sure.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 8 of 36, by superfury

Posted on 2018-01-20, 15:17

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5464
Joined: 2014-03-08, 11:25
Location: Netherlands

I'd assume it'd have to be able to do 8-bit reads/writes. Imagine reading VGA VRAM with 32-bit reads only. That would make it incompatible with all (S)VGA read/write modes. Like mode 1 for moving 4 planes at once using a byte read/write using MOVSB causing the wrong bytes to be latched into memory(it would result in byte 0(planes 0-3) being latched and written to byte 0(planes 0-3), but on a 386DX would instead cause byte 3(planes 0-3) to be written to byte 0-3(planes 0-3)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 36, by vladstamate

Posted on 2018-01-20, 17:02

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

We have to treat ISA BUS differently. There is 8bit ISA and 16bit ISA. So by the time the request reaches the VGA card it is not in the form the CPU sent it. Those should be ok to be separated in byte accesses. I believe the a 16bit ISA VGA card can understand 8bit access vs 16bit access, but I am not 100% sure.

What I am talking about is memory, as in RAM.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 10 of 36, by vladstamate

Posted on 2018-01-20, 21:28

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

Superfury, can you configure your emulator to do either DX or SX 386? If you can, can you please run WHICHCPU? I found it in the harddisk image (which I believe you have too) from 8086tiny emulator. If you do not have it I can give it to you.

In my case, it detects only SX.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 11 of 36, by Scali

Posted on 2018-01-20, 23:58

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

vladstamate wrote:
We have to treat ISA BUS differently. There is 8bit ISA and 16bit ISA. So by the time the request reaches the VGA card it is not in the form the CPU sent it. Those should be ok to be separated in byte accesses. I believe the a 16bit ISA VGA card can understand 8bit access vs 16bit access, but I am not 100% sure.

The ISA bus has a special lines to indicate 16-bit transfers: http://pinouts.ru/Slots/ISA_pinout.shtml
So ISA slots work as 8-bit by default, but a 16-bit card can signal on the extended part of a 16-bit ISA slot that it uses 16-bit memory or IO transfers.
See here for more info (signals SBHE, MEMCS16 and IO16):
http://pinouts.ru/Slots/ISA_pinout.shtml

In short, the system will pull SBHE low, and the card has to respond with MEMCS16/IOCS16 to accept a 16-bit transfer. if the card does not, the system will split up the transfer in two 8-bit transfers (for backward compatibility, 8-bit slots have no SBHE and MEMCS16/IOCS16).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 12 of 36, by vladstamate

Posted on 2018-01-21, 00:37

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

Thank you Scali !

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 13 of 36, by vladstamate

Posted on 2018-01-21, 02:18

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

Ok, CAPE now implements 16bit and 32bit buses properly (I have separate classes for each). That is it produces bus transactions of the size of the bus (either 16bit or 32bit) and then masks out data.

So for a 32bit load from offset 3 or 6 I see this:

1Offset   386SX(16bit bus)        386DX(32bit bus)
2=================================================
33        4x16bit transactions    2x32bit transactions
46        2x16bit transactions    2x32bit transactions

So now whichcpu.exe correctly determines wether CAPE is emulating a 386DX or a 386SX.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 14 of 36, by Scali

Posted on 2018-01-21, 10:43

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

vladstamate wrote:
So for a 32bit load from offset 3 or 6 I see this: […]
Show full quote
So for a 32bit load from offset 3 or 6 I see this:
1Offset   386SX(16bit bus)        386DX(32bit bus)
2=================================================
33        4x16bit transactions    2x32bit transactions
46        2x16bit transactions    2x32bit transactions

That's interesting though... apparently it can eliminate 2 redundant 16-bit loads at offset 6, but it cannot eliminate a single redundant 16-bit load at offset 3.
Beacuse, worst case you always only need to load 3 16-bit words for an unaligned 32-bit word.
So that may be the 'thinking in 32-bit'... It creates one or two 32-bit fetches depending on alignment, and an entire 32-bit fetch can be eliminated as being redundant... But the 16-bit accesses can't be eliminated.
This is an interesting thing to verify on hardware.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 15 of 36, by superfury

Posted on 2018-01-21, 20:15

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5464
Joined: 2014-03-08, 11:25
Location: Netherlands

It still seems odd to always perform 32-bit accesses or 16-bit accesses. Imagine a 32-bit access on VGA VRAM window edge case(like prefetches) at 9FFFF when reading a byte(valid on protection). It would fetch VRAM as well in your case, even when masking partly off? So 9FFFF-A0003, causing VRAM to be latched when always doing 32-bit accesses, even when further broken up afterwards with a mask(it's still being read&latched on a VGA)?

Edit: Reading http://www.phatcode.net/res/260/files/html/Sy … nizationa2.html , it seems that an odd memory address might even generate three memory accesses? One at x(byte), one at x+1(word) and one at x+3(byte)? That would happen on every odd address? But no other bytes would actually be read with 32-bit quantities only?

Last edited by superfury on 2018-01-21, 20:32. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 16 of 36, by Scali

Posted on 2018-01-21, 20:30

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

I think the VGA latches simply respond to what address is on the bus.
So unless you actually fetch a word or dword starting at A0000h or higher, VGA won't 'see' it.

Having said that, there are indeed 'wraparound' issues when doing unaligned word/dword reads or writes at the end of a segment.
I ran into a bug on my 286 at one point, and the reason was that I used rep movsw starting on an unaligned address.
This worked fine on an 8088, but on a 286 I got unexpected side-effects. The write didn't wrap around as I expected.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 17 of 36, by superfury

Posted on 2018-01-21, 20:35

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5464
Joined: 2014-03-08, 11:25
Location: Netherlands

What about a protected mode segment pointing to base 9FFFF? What would a byte read from offset 0 do? Will it read into VRAM? Or will it just read 9FFFD?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 18 of 36, by superfury

Posted on 2018-01-21, 20:48

superfury Offline

Rank l33t++

Rank: l33t++
Posts: 5464
Joined: 2014-03-08, 11:25
Location: Netherlands

Just wanted to try running the app(whichcpu) when I noticed a bug causing the disk image to be unreadable by MS-DOS(Wrongly reported default CHS values). After having fixed those, the MS-DOS 6.22 sfdimg/img disk images became unreadable(data 100MB buffer img) and after fixing the CHS formation(using bytes instead of sectors it expects in the formula), the main hdd image(sfdimg disk) refused booting MS-DOS 6.22 as well. Then after creating a new disk image(to move files later using WinImage and paritioning it, after rebooting(80386 XT configuration on UniPCemu) trying format the FDD emulation started to fail it seems(either that or the HDD emulation)? There were long delays on reading port 1F7(new hdd image) during booting, but even running "format c: /q /u /s" after the long during boot seems to crap out the FDC emulation somehow(Turbo XT/XTIDE BIOS) with some drive not ready error?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 19 of 36, by vladstamate

Posted on 2018-01-22, 00:16

vladstamate Offline

Rank Oldbie

Rank: Oldbie
Posts: 967
Joined: 2015-08-23, 01:43

superfury wrote:
It still seems odd to always perform 32-bit accesses or 16-bit accesses.

But it is bus wide. Either 32bit or 16bit. From the page you quoted ( http://www.phatcode.net/res/260/files/html/Sy … nizationa2.html):

The address placed on the address bus is always some multiple of four. Using various "byte enable" lines, the CPU can select which of the four bytes at that address the software wants to access.

This is how CAPE currently implements is BIU and bus operations. Although I still need to honor graphics card (ISA) bus width, either 8bit or 16bit, have not done that yet.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Main menu

Common searches