From what I understand, the Netburst architecture was inefficient due to it's long staged pipelines witch, while it allowed for high clock rates caused severe branch miss prediction - up to 33% more than a Pentium III running at the same frequency. You can find more details on the Netburst architecture here https://en.wikipedia.org/wiki/NetBurst_ ... hitecture) - but long story short, Netburst was inefficient compared to its predecessor and competitor, and no amount of re-engineering could solve the problem.
As far as I can figure the Nehalem architecture (successor of the Core architecture) on witch the first i7 and i5 cpus are based on borrows from both the P6 line (wider core / multiple parallel pipelines) as well as the P7 architecture - longer pipeline, 20-24 stages for Nehalem vs 12-14 for the Peryn/Yonah processors. While the pipeline on the Nehalem still doesn't have as many stages as the Prescott P7 (31 stage pipeline), it nearly doubles that of the Peryn cores - so you could say the Nehalem is a compromise between the P6 and the P7 architectures.
Intel seems to have gone a different route with current core CPUs. Ivy bridge and Hanswell CPUs have a wider and shorter pipeline - only 14-19 stages, witch more closely resembles the Pentium M core's 12-14 stage pipeline. This approach seems to make CPUs more power-efficient while maintaining or even increasing performance, and in this iteration (Hanswell) allows for high clock frequencies due to new technologies such as 3D tri-gate transistors and the 22nm manufacturing process.
One architecture (P6) seems to favor parallelism, while the other (P7) favors higher clock frequencies. It seems that modern intel CPUs, particularly the Hanswell 4th generation Core processors reverted back to short / wide pipelines.
Compared to the original Nehalem i7, the Hanswell has shorter / wider pipelines, but maintains other architectural elements, such as hyper-threading, 64kb L1 cache, 256kb L2 cache and the presence of shared L3 cache, intel QPI, integrated memory, pci-e and DMI controller, so I guess things haven't changed significantly since 2007 😀. To be fair, Hanswell has a "wider" core then it's predecessors. Four ALUs, a second branch prediction unit, a third address generation unit as well as deeper buffers and an improved memory controller. Floating point performance is also greatly increased in Hanswell over previous generations, alltough I can't seem to be able to figure out why. In synthetic FPU (Julia, VP8) benchmarks alone, the 47W i7 4710hq powering my laptop scores higher then a desktop 77W i7 3770k I've been playing with. This is impressive considering that the 4710hq turbos up to 3.4GHz while the 3770k runs at 3.4Ghz and goes as high as 3.9GHz with enough thermal headroom. By this alone, FPU Julia tests should favor the 3770, but they don't.
Broadwell and Skylake microarchitectures seem to be (as far as I can understand) a die shrink of Hanswell plus the addition of new instructions and DDR4 support. Skylake-U is also rumored to feature L4 cache.
So I guess the Netburst architecture wasn't a total waste. It brought high clocks, hyper threading and L3 cache, features that are still in use today.