GL1zdA wrote:
It doesn't say that.
It says: "The P6 processor family uses a decoupled, 12-stage superpipeline..."
So that is Pentium Pro, II and III (Intel does not consider Pentium M and later to be P6-family, they are not listed in paragraph 2.1.6, but have their own section in 2.1.9).
I suppose the confusion over 10 or 12 pipeline stages depends on what you're measuring. I believe 10-stages is the shortest path, where FPU/SSE/MMX need the two extra stages, bringing the worst-case up to 12.
Likewise, the 31-stage number for Prescott is worst-case, where 28-stage is best-case iirc. However, this is complicated further by the fact that P4 splits its pipeline into a decoding part (before trace-cache) and an execution part. As a software developer, you don't 'see' what happens before the trace-cache, so extra stages added to the decoder part don't affect the performance of your actual code as long as it runs from trace cache.
Which is why Intel doesn't distinguish between different versions of Netburst in the optimization manual. They all follow the same optimization rules and instruction timings.
GL1zdA wrote:I jumped to another topic to quickly. I meant that the original P6 architecture (Pentium Pro-Katmai) lasted longer then the first P4 architecture (Willamette-Northwood).
That's comparing apples and oranges. P6 should be compared to all of Netburst, not just to Willamette-Northwood (that's why the optimization manual is laid out like that. This is how Intel classifies their CPUs in terms of different microarchitectures, and how you need to optimize code for each microarchitecture).