Reply 80 of 279, by gdjacobs
- Rank
- l33t++
wrote:No, this is a common misconception, going back to the days of the Pentium 4 (AMD claimed they didn't need the technology, becaus […]
No, this is a common misconception, going back to the days of the Pentium 4 (AMD claimed they didn't need the technology, because they didn't have the technology. Everyone bought into that story, except me. Because I do critical thinking).
Did you ever bother to think it through in the context of a modern x86?
A Core i7 has a lot of execution units per core. The legacy two-operand instructionset of x86 however is inadequate to feed all of these units every cycle.
This is not a 'pipeline stall' in the traditional sense, but you do have many units that are sitting idle every cycle.
By feeding two or more instruction streams, you can reach better utilization of these units.
CPUs starting with Core 2 were also able to benefit from wider vector instructions which I believe are much more beneficial in terms of performance. Sharing common resources between two (or more) threads is certainly a thing, though. Even AMD adopted that approach with the floating point unit of Piledriver, they just made it quite anemic. It just strikes me as more of a workaround for improperly vectorized code.
wrote:Which is why HT works at least as well on modern Core i7 CPUs as it did on the Pentium 4, while Core i7 is the complete opposite of the Pentium 4 in terms of pipeline design, stalls, and cost of these stalls.
(Why would Intel have brought back HT otherwise? It was gone from the Core2, because Core2 was not based on the Pentium 4 architecture, and had to be migrated. Which they did in Nehalem, and it has stayed there ever since. If there was no merit to it, it wouldn't have been here today, and AMD certainly wouldn't be trying to copy it).
My contention wasn't that it's completely ineffective, just that it's most profitable when it helps mitigate pipeline stalls. How large is the cost in die size. If it's inexpensive, it may be desirable even with more limited benefit. Plus, shorter pipelines do have a stall penalty, it's just less. And in some cases, SMT might help keep execution units stuffed.
You're right that I haven't looked at this in depth lately. Have you got any reference material for i7 which looks at performance deltas with SMT?
wrote:As for RISC, There's no such thing anymore. We are well into the post-RISC era, where even 'true' RISC architectures are now very similar to x86 in how they execute legacy code: A lot of instructions aren't implemented in hardware, but are decoded to microcode, or even emulated in software altogether.
x86 itself has of course also been using a RISC backend since the Pentium Pro era. The boundaries between the two are fading fast.
But if you want to argue about older RISC, if anything, the main characteristic of RISC has been very short and simple pipelines, with few and short stalls. Yet RISC is where you saw SMT first, on the DEC Alpha architecture. Which is where Intel got their HT technology from. Of course it was originally developed by IBM and also implemented on their POWER RISC architecture. Both have considerably shorter pipelines than the Pentium 4, and in fact, very similar to modern Core i7 pipelines (in the range of 14-16 stages, where Pentium 4 was 28+).
Ah, that's why I wasn't aware of SMT coming from the Alpha team. The sweet, sweet unreleased EV8.
I agree completely regarding the state of modern architectures. Essentially everyone has adopted a RISC backend. Intel and AMD simply have substantial hardware decoding strapped to the front end. That's what I meant by everyone riffing on the same tune these days.
wrote:You might want to update your knowledge, because it all sounds more than 10 years out of date.
Any pointers, just point them my way!
All hail the Great Capacitor Brand Finder