Reply 40 of 81, by Scali
wrote:They not only back-ported SSE2 but also the already mentioned quad-pumped FSB and also the multi-core feature (Core 2 Duo) and hyperthreading (Core i, if you still consider that P6 architecture - IMO it is more related to P6 than to Netburst). (Admittedly the dual-core Pentium D is not two integrated cores like the Core Duo, more like the two dual-cores slapped together in the Core 2 Quad.)
AFAIR Netburst was not only limited by the long pipeline, but also by the narrow decoder which was partially compensated by the L1 code cache being a trace cache. As long as code is executed from L1 everything is fine, but when running code residing only in L2 the decoder can be the bottleneck (L2 having very high bandwidth).
They did back-port trace cache to the Core-line as well, starting with Nehalem.
The difference is that it is now used only for loops. So they don't rely on trace cache for all execution (which was somewhat of a bottleneck, then again the Pentium 4 had only one decoder, where Core has 3 or 4).
They don't call it trace cache anymore, but it is the same thing, more or less. They have a 'loop stream detector', and when a loop is detected, instructions are cached in uOp form, rather than as x86 opcode bytes. The cache is just a lot smaller, because it only has to store one loop at a time (originally 28 uOps per thread, Skylake does 64 uOps per thread). So it is just a simple uOp instruction queue.