Reply 80 of 124, by Scali
wrote:That is pretty intersting. What I don't really understand is how the logic works to re-order the instructions. This logic is presumably hardware-based and on the CPU itself. I would imagine it is quite complex, but what's the crux of the alogrythm? Is it like a little ASIC in the CPU dye? Also, to reorder the instructions, the CPU must gather multiple instrutions for the reordering. What exactly does this mean? Are there, for example, a half-dozen instructions layed out in the pipeline, probably in a shift register, then get reordered within that shift register? And what ordering would be optimal? It is a fascinating subject that I wish I knew more about from a high-level perspective.
The CPU is basically 'split in two'.
The first part decodes instructions into a buffer, into what is commonly known as 'micro ops'. These are simple RISC-like instructions for the internal execution pipeline.
The second part contains a scheduler. This is basically a sort of 'scoreboarding' where the CPU keeps track of which operands and execution units are ready for execution.
So the CPU will constantly scan the buffer of decoded instructions for instructions that it can execute. The order is basically determined by availability of the resources. It is not necessarily the 'optimal' ordering for the entire program, but it is reasonably optimal for the instructions in the window.
This makes optimizing for these CPUs an interesting problem: instead of trying to reduce your algorithm to the minimum number of instructions, you now need to think about the out-of-order logic. Two things are important:
1) Use the simplest encoding of instructions whenever possible (some x86 instructions map 1:1 to micro ops, others decode into multiple dependent micro ops, taking more resources both for decoding and execution).
2) Reduce dependencies of instructions on previous results to a minimum. The CPU only has a limited 'window' of instructions it can pre-decode and buffer, so the input for the out-of-order logic is limited. If you fill the buffer with instructions that each depend on the previous one, then it cannot take advantage of instruction-level parallelism.
Hyperthreading is a clever trick to make 2) more efficient: By decoding not one but two (or more) threads at the same time, you have a lot more independent instructions to choose from. Instructions from different threads are independent by definition.