VOGONS


Clocks per memory read

Topic actions

First post, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Say I have this instruction:

ADD [SI], AX

How many cycles (CPU that is) should this take on a 8086, 8088, 286 and so on?

According to this it should be 16 cycles (for 808x) + memory access. I am looking for the "+memory access" part. Say I have 200ns memory (typical of an IBM XT machine), how long does it take for each byte to be read? I need to read 2 bytes (at [SI]) and then write 2 bytes (again at [SI]) for a total of 4 bytes transaction. Given the memory timing, is this basically 4 cycles? Or is it 2 in case of 8086 (since 16bit bus)?

I am basically looking at a relationship between memory timing and bus/cpu cycles per byte read. I am looking to extend this formula to 286 and 386 which my emulator supports.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 1 of 9, by Jepael

User metadata
Rank Oldbie
Rank
Oldbie

I think you want 8086 family reference manual.

Actually the + memory access does not mean actually reading/writing the memory, but calculating the effective address (EA) which memory address to read/write.

So, the
-ADD instruction itself needs 16 cycles to perform the addition
-EA calculation of using [BX] means using index or base only and that takes 5 clock cycles. (Using segment override would +2 cycles more).
-reading source and writing destination means 2 memory transfers. 8086 needs +4 cycles for each 16-bit word transfer at odd address. 8088 needs +4 cycles for each 16-bit word transfer. But I can't determine if using even address with 8086 to access word or 8088 to access a byte is already taken into account in the 16 cycles of ADD.

But, it's only the minimum, and actual amount of cycles taken depends on a lot of factors, like what was executed before it or what is executed after it, because the prefetch queue may be empty or full at some point so it might use bus cycles to fetch data and that could delay the memory reads/writes. Also, who knows what other stuff is happening in the background, like memory refresh cycles or DMA transfers.

Reply 2 of 9, by Scali

User metadata
Rank l33t
Rank
l33t

On an XT (8088 CPU) at 4.77 MHz, every byte takes 4 CPU cycles to read or write. That means *every byte*, so including the bytes for the instruction, and possibly bytes accessed by DMA.
As you probably know, there is a prefetch buffer for instructions, so instructions can be read ahead during 'idle' bus cycles, rather than at the moment the instruction will be executed.
So you should emulate the actual state of the data bus and the prefetch buffer. You can't just solve it on a per-instruction basis, because the same instruction may take longer or shorter, depending on whether it was prefetched, whether there's DMA, and whether there may be additional wait states (eg when you access video memory).

With faster systems, there are a lot of variables to take into account. Not all 286/386 machines use the same memory, and different chipsets can also affect things like memory performance. And then there's caching. In short, I think you can just 'wing it', by implementing an approximate emulation of a cache and memory controller, and giving the user some control over the speed (which they tended to have on more advanced BIOSes anyway, where you could select waitstates, and fast/slow refresh etc). Nobody would expect an exact speed for 286+ machines, because in practice they were all different. PC/XTs and clones didn't have that much variation. A lot of early clones used the exact same chipset as the IBMs, or a very close approximation of it.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 3 of 9, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Thank you Jepael and Scali (as usual, very helpful!).

Got the 808x family manual.

Scali guessed exactly what I want to do. I want to "wing it" while still being correct for cycles values that I DO know (like instruction execution time) and approximate for everything else.

I will check the 286 and 386 family user manual for cycles required for byte transfer (taking into account 16bit vs 32bit bus and individual CPU family ability to "swallow" bytes from the data bus).

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 4 of 9, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Here is my pseudocode for as accurate as I can get with instructions on 808x/286/386

Whenever an instruction gets executed it sets how many cycles to wait for execution simulation and number of bytes to read or write. Based on those 2 pieces of information and certain CPU information like cycles/bus cycle and bus width I can simulate more or less (with emphasis on less) accurate the proper "speed".
Input:

clocks_to_wait: how long should an instruction take (execution plus EA calculation only)
data_transfer_size: bytes to transfer per instruction
clock_per_byte_transfer: CPU specific
clock_count: how many clocks since power on

clock_count++;
// every 4th cycle for 8086
if((clock_count % clock_per_byte_transfer) == 0)
{
// data_transfer_size is set by the instruction that just got executed, eg 4 bytes in case of ADD [BX], AX
if(data_transfer_size)
data_transfer_size--;
else
{
// when there is nothing to transfer, fill the prefetch buffer
if(!prefetch_full)
{
prefetch_buffer[prefetch_current] = read_data(CS:IP+prefetch_current);
prefetch_current++;
if(prefetch_current == prefetch_length)
prefetch_full = true;
}
}
}
if(clocks_to_wait)
clocks_to_wait--;
else
if(prefetch_current > 0) // do we have at least 1 byte in the prefetch buffer?
execute_next_instruction(prefetch_buffer);

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 5 of 9, by Scali

User metadata
Rank l33t
Rank
l33t

I would like to add that you should also check for alignment of the data.
A 286 can only read 16-bit aligned words.
If you read a 16-bit word from an odd address, it actually reads two aligned words, and extracts a byte from each, to generate the unaligned word in the register. So this takes twice as long.
For bytes there is no way to be misaligned, so it always just needs one access.

For 386 the same goes, except its (d)words are 32-bit, so you have 4-byte alignment.
I'm not entirely sure how it handles 16-bit accesses. In theory it could handle certain cases of unaligned 16-bit words in 1 cycle, as long as they fit into an aligned 32-bit dword.
But I think in practice it may not be 'clever' enough to access words from anywhere in the dword, so the same alignment rules apply as on 286.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 6 of 9, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Good point. I can roll that in the "data_transfer_size". I will increment it twice if I see an unaligned read or write. For example read 16 bytes from odd address will add 4 to "data_transfer_size" whereas from an even address it will only add 2. My code also did not account for bus width. Here is an update with those 2 things corrected.

CPU_BYTES_PER_BUS_TRANSACTION =
- 1 for 8088
- 2 for 8086, 80816, 80286 and 80386SX
- 4 for 80386DX

EDIT: the prefetch bus also fills up at the specific CPU BUS width.

clock_count++;
// every 4th cycle for 8086
if((clock_count % clock_per_byte_transfer) == 0)
{
// data_transfer_size is set by the instruction that just got executed, eg 4 bytes in case of ADD [BX], AX
if(data_transfer_size)
data_transfer_size -= CPU_BYTES_PER_BUS_TRANSACTION; // don't let this go below 0
else
{
// when there is nothing to transfer, fill the prefetch buffer
if(!prefetch_full)
{
prefetch_buffer[prefetch_current] = read_data(CS:IP+prefetch_current);
prefetch_current += CPU_BYTES_PER_BUS_TRANSACTION;
if(prefetch_current >= prefetch_length)
prefetch_full = true;
}
}
}
if(clocks_to_wait)
clocks_to_wait--;
else
if(prefetch_current > 0) // do we have at least 1 byte in the prefetch buffer?
execute_next_instruction(prefetch_buffer);

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 7 of 9, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

For 386 the same goes, except its (d)words are 32-bit, so you have 4-byte alignment.
I'm not entirely sure how it handles 16-bit accesses. In theory it could handle certain cases of unaligned 16-bit words in 1 cycle, as long as they fit into an aligned 32-bit dword.
But I think in practice it may not be 'clever' enough to access words from anywhere in the dword, so the same alignment rules apply as on 286.

I found this document which (partially) explains the unknowns for 386:

- it can write 16bits at odd addresses (I assume with no cycle penalty)
- 32bit writes do not have to be word aligned
- it takes 2 CPU cycles per bus cycles (as opposed to 808x which is 4).

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 8 of 9, by Scali

User metadata
Rank l33t
Rank
l33t
vladstamate wrote:
I found this document which (partially) explains the unknowns for 386: […]
Show full quote

I found this document which (partially) explains the unknowns for 386:

- it can write 16bits at odd addresses (I assume with no cycle penalty)
- 32bit writes do not have to be word aligned
- it takes 2 CPU cycles per bus cycles (as opposed to 808x which is 4).

Note that this document is for the 386SX, which has a 16-bit bus. So indeed, it does not suffer from 32-bit alignment problems.
386DX will work differently.
Perhaps Michael Abrash's Black Book is of some use to you: http://twimgs.com/ddj/abrashblackbook/gpbb11.pdf

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 9 of 9, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Thank you Scali.

Oh man, that chapter is GREAT! Among other things it reminded me of something I am ashamed to have forgotten: jumps clear the pre-fetch queue.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/