I always find the Core 2 to P4 comparison discussions interesting. I have no love for the P4 architecture, but I can say as an EE with a digital design background that the P4 architecture was way ahead of its time (about 20 years).
Intel likely realized that, with speed being king, they would have to extend the pipelines to maintain stable operations at, what was then, extremely high clock speeds. It was a matter of transistor-to-transistor latency that process improvements could never overcome. Larger stages could take too long to complete their operations within a single, fast clock tick. At a certain clock speed, the next tick occurs before the last instruction completes. You would need to make a smaller stage (either via breaking up a complex stage or creating a more efficient stage) to increase speed. Intel opted to break up the complex stages (vice efficacy) likely knowing that they would reach that point eventually anyway. The P4, as a result, performed poorly at initial clock speeds, but ramped up quickly to 3 GHz + speeds in relatively short time.
Still, the P4 stage design was really meant for 5GHz + speeds that the manufacturing processes of the time couldn't support. Excessive, for the time, power draw was simply a function of process size, clock speed, transistor density, transistor usage, and voltage. You can see all of this at play today with current designs from both Intel's and AMD's 5 GHz plus chips. Both companies employ much longer pipelines than superscaler chips of the late 90s, and transistor density has pushed power consumption much higher than we felt comfortable with in those days. As I said in another thread, we don't bat an eye at 20-stage pipelines and 100W+ chips these days, but we hate that it happened 20 years ago.
BTW, I suspect that a Core 2 Duo manufactured on a 90 nm process running at 3GHz would have consumed 100W+ if one could run that fast. Just saying that history needs a little perspective, too.