VOGONS


Was the P4 architecture a dead end?

Topic actions

Reply 20 of 81, by Scali

User metadata
Rank l33t
Rank
l33t

I have always wondered if P4 was meant to be a dead end. At the time, Intel was working on the Itanium, and the idea was to migrate from x86 to Itanium completely. 32-bit x86 would be the end of the line.
It took Intel very long to come up with a successor for the P4 (the P4 is the longest running x86 architecture in the history of x86, which is especially strange since it wasn't exactly the most competitive one). It could be that Intel had to reverse a decision to stop x86 at some point, before they finally started working on the Core2 Duo.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 21 of 81, by Anonymous Coward

User metadata
Rank l33t
Rank
l33t

I would love to see some high-level internal memos from Intel leak out one day...even going as far back as iAPX and the 286 era.

"Will the highways on the internets become more few?" -Gee Dubya
V'Ger XT|Upgraded AT|Ultimate 386|Super VL/EISA 486|SMP VL/EISA Pentium

Reply 22 of 81, by GL1zdA

User metadata
Rank Oldbie
Rank
Oldbie
kanecvr wrote:

Besides, modern intel core CPUs (Sandy Bridge, Hanswell, etc) use 100MHz FSB and Quick Path Interconnect, so FSB isn't a performance determining factor anymore.

There's no such thing as FSB on modern CPUs.

getquake.gif | InfoWorld/PC Magazine Indices

Reply 23 of 81, by xjas

User metadata
Rank l33t
Rank
l33t
GL1zdA wrote:
kanecvr wrote:

Besides, modern intel core CPUs (Sandy Bridge, Hanswell, etc) use 100MHz FSB and Quick Path Interconnect, so FSB isn't a performance determining factor anymore.

There's no such thing as FSB on modern CPUs.

What do you mean by that? What speed does the motherboard run at then?

twitch.tv/oldskooljay - playing the obscure, forgotten & weird - most Tuesdays & Thursdays @ 6:30 PM PDT. Bonus streams elsewhen!

Reply 24 of 81, by alexanrs

User metadata
Rank l33t
Rank
l33t
kanecvr wrote:

They quad-pumped the FSB on the pentium M just as they did on the pentium 4, so that wasn't an issue. Besides, modern intel core CPUs (Sandy Bridge, Hanswell, etc) use 100MHz FSB and Quick Path Interconnect, so FSB isn't a performance determining factor anymore.

I should've said "Pentium 3 AS IS", as I did not mean the architecture per se, but rather the implementation. They needed a new platform.

Reply 25 of 81, by GL1zdA

User metadata
Rank Oldbie
Rank
Oldbie
xjas wrote:
GL1zdA wrote:
kanecvr wrote:

Besides, modern intel core CPUs (Sandy Bridge, Hanswell, etc) use 100MHz FSB and Quick Path Interconnect, so FSB isn't a performance determining factor anymore.

There's no such thing as FSB on modern CPUs.

What do you mean by that? What speed does the motherboard run at then?

The motherboard? There's no such thing like "motherboard speed". During the P6 generation (from Pentium Pro until Core 2) the FSB was the bus between the CPU and the Northbridge (as opposed to the Back Side Bus between the CPU and L2 cache). Since the Nehalem FSB doesn't exist anymore - there's either a high bandwidth ring bus or a crossbar inside the CPU that connects the CPU cores, memory agents etc. and QPI for communication between CPUs in a multiprocessor system. You can read about it here: http://www.realworldtech.com/sandy-bridge/8/ .

getquake.gif | InfoWorld/PC Magazine Indices

Reply 26 of 81, by alexanrs

User metadata
Rank l33t
Rank
l33t
Scali wrote:

I have always wondered if P4 was meant to be a dead end. At the time, Intel was working on the Itanium, and the idea was to migrate from x86 to Itanium completely. 32-bit x86 would be the end of the line.
It took Intel very long to come up with a successor for the P4 (the P4 is the longest running x86 architecture in the history of x86, which is especially strange since it wasn't exactly the most competitive one). It could be that Intel had to reverse a decision to stop x86 at some point, before they finally started working on the Core2 Duo.

It just hit me - Netburst is not the longest running architecture in the history of x86, P6 is. It started with the Pentium Pro in 1995, went on as the Pentium 2, the Pentium 3, the Pentium M and Core Solo/Duo (Yonah), and was retired at the same time Netburst was: 2008 when the P6-based Intel Core architecture (Core2 Duo) replaced them. It is weird to realize Intel held onto P6 for thirteen years!

This is, of course, not counting stuff like ancient processors that kept being produced for embedded applications long after their prime, like 80386s.

Reply 27 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
alexanrs wrote:

It just hit me - Netburst is not the longest running architecture in the history of x86, P6 is. It started with the Pentium Pro in 1995, went on as the Pentium 2, the Pentium 3, the Pentium M and Core Solo/Duo (Yonah), and was retired at the same time Netburst was: 2008 when the P6-based Intel Core architecture (Core2 Duo) replaced them. It is weird to realize Intel held onto P6 for thirteen years!

I think that is up for interpretation.
For me, P6 ended with the PIII. Pentium M and especially Core Solo/Duo made a lot of changes to the architecture. It's similar, but not the same. For one, they made the pipeline longer to scale to higher clockspeeds. They also modified the decoder to implement op-fusion. I see these as quite fundamental changes, and we should no longer treat them as the same micro-architecture.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 28 of 81, by alexanrs

User metadata
Rank l33t
Rank
l33t
Scali wrote:

I think that is up for interpretation.
For me, P6 ended with the PIII. Pentium M and especially Core Solo/Duo made a lot of changes to the architecture. It's similar, but not the same. For one, they made the pipeline longer to scale to higher clockspeeds. They also modified the decoder to implement op-fusion. I see these as quite fundamental changes, and we should no longer treat them as the same micro-architecture.

Fair enough.
Still, there is one Tualatin Pentium III-M ULV that was released in 2003 (and the P-III M was produced until 2003 - 2004 is Pentium M territory), so the original P6 also spanned 8 years. Therefore both P6 and Netburst ran for at about the same time. Still impressive considering Intel's tick-tock roadmap, with architectures being replaced after 2-3 years.

Reply 29 of 81, by GL1zdA

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

I think that is up for interpretation.
For me, P6 ended with the PIII. Pentium M and especially Core Solo/Duo made a lot of changes to the architecture. It's similar, but not the same. For one, they made the pipeline longer to scale to higher clockspeeds. They also modified the decoder to implement op-fusion. I see these as quite fundamental changes, and we should no longer treat them as the same micro-architecture.

That's not entirely true. The P6 received an update with the Coppermine - the pipeline was shortened, but then it was longer again in Banias. Same with the P4. Willamette's 20 stage pipeline grew to 31 stages in Prescott. So if you count it seems the original P6 (from Pentium Pro till Katmai) was longer than the first P4 (Willamette-Northwood).

getquake.gif | InfoWorld/PC Magazine Indices

Reply 30 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
GL1zdA wrote:

That's not entirely true. The P6 received an update with the Coppermine - the pipeline was shortened

First time I hear of this. Can't find any reference to that online. Coppermine made the pipeline more efficient, by removing some stall conditions, but that's not the same as shortening the pipeline. It's still 10 stages.

GL1zdA wrote:

So if you count it seems the original P6 (from Pentium Pro till Katmai) was longer than the first P4 (Willamette-Northwood).

Erm, how do you figure?
The P6 pipeline is 10 stages, nowhere near a P4. Even the Willamette is twice as long.
Banias/Dothan were never specified by Intel, but Intel did say they were 'longer than P3, shorter than P4'. It is commonly accepted that they are 12-14 stages.

GL1zdA wrote:

Willamette's 20 stage pipeline grew to 31 stages in Prescott.

This is too simplified. The active part of the pipeline during execution is mostly the same between Willamette and Prescott. I would direct you to Intel's manuals for details on that, eg: http://download.intel.com/support/processors/ … sb/25366521.pdf
Or Agner Fog's documentation on it: http://www.agner.org/optimize/microarchitecture.pdf

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 31 of 81, by vmunix

User metadata
Rank Member
Rank
Member
Anonymous Coward wrote:

I would not give Intel the benefit of the doubt, especially considering they have a history of scamming (see 486 Overdrive Socket, i487).
A

Intel 486DX with disabled FPU = 486SX
Intel Rapidcad with 486 instructions removed.
Processors marked below its achievable rating only to fill market niches (how can you explain a 300Mhz part being clocked at 450 and beyond perfectly fine)
CPUs with half of it's perfectly functioning cores or caches disabled (AMD also) and the list goes on..

Trailing edge computing.

Reply 32 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
Anonymous Coward wrote:

I would not give Intel the benefit of the doubt, especially considering they have a history of scamming (see 486 Overdrive Socket, i487).
AMD is a company run by scammy lawyers. I have no love for them either.

Yea, speaking of which, apparently someone is starting a class action against AMD for selling their CPUs as 8-core, while not delivering the expected performance of an 8-core system: http://legalnewsline.com/stories/510646458-am … tion-of-new-cpu

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 33 of 81, by GL1zdA

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:
GL1zdA wrote:

That's not entirely true. The P6 received an update with the Coppermine - the pipeline was shortened

First time I hear of this. Can't find any reference to that online. Coppermine made the pipeline more efficient, by removing some stall conditions, but that's not the same as shortening the pipeline. It's still 10 stages.

Here's an unsourced wikipedia article: https://en.wikipedia.org/wiki/P6_%28microarchitecture%29 . I googled it a while ago by trying various codenames and expressions like "10-stage" "12-stage" and "14-stage". The Pentium Pro's pipeline is 12-stage long http://download.intel.com/design/processor/ma … uals/253665.pdf .

Scali wrote:
Erm, how do you figure? The P6 pipeline is 10 stages, nowhere near a P4. Even the Willamette is twice as long. Banias/Dothan wer […]
Show full quote
GL1zdA wrote:

So if you count it seems the original P6 (from Pentium Pro till Katmai) was longer than the first P4 (Willamette-Northwood).

Erm, how do you figure?
The P6 pipeline is 10 stages, nowhere near a P4. Even the Willamette is twice as long.
Banias/Dothan were never specified by Intel, but Intel did say they were 'longer than P3, shorter than P4'. It is commonly accepted that they are 12-14 stages.

I jumped to another topic to quickly. I meant that the original P6 architecture (Pentium Pro-Katmai) lasted longer then the first P4 architecture (Willamette-Northwood).

Scali wrote:
GL1zdA wrote:

Willamette's 20 stage pipeline grew to 31 stages in Prescott.

This is too simplified. The active part of the pipeline during execution is mostly the same between Willamette and Prescott. I would direct you to Intel's manuals for details on that, eg: http://download.intel.com/support/processors/ … sb/25366521.pdf
Or Agner Fog's documentation on it: http://www.agner.org/optimize/microarchitecture.pdf

Good article, will have to read through it.

getquake.gif | InfoWorld/PC Magazine Indices

Reply 34 of 81, by kanecvr

User metadata
Rank Oldbie
Rank
Oldbie
GL1zdA wrote:

The motherboard? There's no such thing like "motherboard speed". During the P6 generation (from Pentium Pro until Core 2) the FSB was the bus between the CPU and the Northbridge (as opposed to the Back Side Bus between the CPU and L2 cache). Since the Nehalem FSB doesn't exist anymore - there's either a high bandwidth ring bus or a crossbar inside the CPU that connects the CPU cores, memory agents etc. and QPI for communication between CPUs in a multiprocessor system. You can read about it here: http://www.realworldtech.com/sandy-bridge/8/ .

I was thinking about the clock the CPU needs to multiply to get it's operating frequency - the reference clock - witch in case of modern core CPUs is 100MHz. Calling it FSB might be technically wrong. But consider that CPUs like the Athlon64 also use a dedicated bus to communicate with the northbridge (hypertransport), and still use a FSB to generate it's operating frequency - and it's still called FSB. Besides, just like with the athlon64, changing this clock will also affect QPI speed causing system instability. Intel might have changed the name from Front Side Bus to a technically correct Reference Clock, but essentially it does the same thing - it gives the CPU, QPI, Hyper Transport, Uncore Clock, memory controller, AGP / PCI-E controller a reference frequency it can multiply or divide.

To me, FSB = Reference clock. Then again, I'm a doctor, not an engineer, so I don't pay as much attention to these things as I probably should 😜

Last edited by kanecvr on 2015-11-09, 10:29. Edited 1 time in total.

Reply 35 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
GL1zdA wrote:

The Pentium Pro's pipeline is 12-stage long http://download.intel.com/design/processor/ma … uals/253665.pdf .

It doesn't say that.
It says: "The P6 processor family uses a decoupled, 12-stage superpipeline..."
So that is Pentium Pro, II and III (Intel does not consider Pentium M and later to be P6-family, they are not listed in paragraph 2.1.6, but have their own section in 2.1.9).

I suppose the confusion over 10 or 12 pipeline stages depends on what you're measuring. I believe 10-stages is the shortest path, where FPU/SSE/MMX need the two extra stages, bringing the worst-case up to 12.
Likewise, the 31-stage number for Prescott is worst-case, where 28-stage is best-case iirc. However, this is complicated further by the fact that P4 splits its pipeline into a decoding part (before trace-cache) and an execution part. As a software developer, you don't 'see' what happens before the trace-cache, so extra stages added to the decoder part don't affect the performance of your actual code as long as it runs from trace cache.
Which is why Intel doesn't distinguish between different versions of Netburst in the optimization manual. They all follow the same optimization rules and instruction timings.

GL1zdA wrote:

I jumped to another topic to quickly. I meant that the original P6 architecture (Pentium Pro-Katmai) lasted longer then the first P4 architecture (Willamette-Northwood).

That's comparing apples and oranges. P6 should be compared to all of Netburst, not just to Willamette-Northwood (that's why the optimization manual is laid out like that. This is how Intel classifies their CPUs in terms of different microarchitectures, and how you need to optimize code for each microarchitecture).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 36 of 81, by xjas

User metadata
Rank l33t
Rank
l33t
kanecvr wrote:
GL1zdA wrote:

The motherboard? There's no such thing like "motherboard speed". During the P6 generation (from Pentium Pro until Core 2) the FSB was the bus between the CPU and the Northbridge (as opposed to the Back Side Bus between the CPU and L2 cache). Since the Nehalem FSB doesn't exist anymore - there's either a high bandwidth ring bus or a crossbar inside the CPU that connects the CPU cores, memory agents etc. and QPI for communication between CPUs in a multiprocessor system. You can read about it here: http://www.realworldtech.com/sandy-bridge/8/ .

I was thinking about the clock the CPU needs to multiply to get it's operating frequency - the reference clock - witch in case of modern core CPUs is 100MHz. Calling it FSB might be technically wrong. But consider that CPUs like the Athlon64 also use a dedicated bus to communicate with the northbridge (hypertransport), and still use a FSB to generate it's operating frequency - and it's still called FSB. Besides, just like with the athlon64, changing this clock will also affect QPI speed causing system instability. Intel might have changed the name from Front Side Bus to a technically correct Reference Clock, but essentially it does the same thing - it gives the CPU, QPI, Hyper Transport, Uncore Clock, memory controller, AGP / PCI-E controller a reference frequency it can multiply or divide.

To me, FSB = Reference clock. Then again, I'm a doctor, not an engineer, so I pay as much attention to these things as I probably should 😜

Ah, so the clock is still there but it doesn't drive most of the functions that allowed us to originally call it the "front side bus" - that makes sense to me now. I don't really follow the bleeding-edge tech development any more so I miss some things.

twitch.tv/oldskooljay - playing the obscure, forgotten & weird - most Tuesdays & Thursdays @ 6:30 PM PDT. Bonus streams elsewhen!

Reply 37 of 81, by GL1zdA

User metadata
Rank Oldbie
Rank
Oldbie
kanecvr wrote:
GL1zdA wrote:

The motherboard? There's no such thing like "motherboard speed". During the P6 generation (from Pentium Pro until Core 2) the FSB was the bus between the CPU and the Northbridge (as opposed to the Back Side Bus between the CPU and L2 cache). Since the Nehalem FSB doesn't exist anymore - there's either a high bandwidth ring bus or a crossbar inside the CPU that connects the CPU cores, memory agents etc. and QPI for communication between CPUs in a multiprocessor system. You can read about it here: http://www.realworldtech.com/sandy-bridge/8/ .

I was thinking about the clock the CPU needs to multiply to get it's operating frequency - the reference clock - witch in case of modern core CPUs is 100MHz. Calling it FSB might be technically wrong. But consider that CPUs like the Athlon64 also use a dedicated bus to communicate with the northbridge (hypertransport), and still use a FSB to generate it's operating frequency - and it's still called FSB. Besides, just like with the athlon64, changing this clock will also affect QPI speed causing system instability. Intel might have changed the name from Front Side Bus to a technically correct Reference Clock, but essentially it does the same thing - it gives the CPU, QPI, Hyper Transport, Uncore Clock, memory controller, AGP / PCI-E controller a reference frequency it can multiply or divide.

To me, FSB = Reference clock. Then again, I'm a doctor, not an engineer, so I pay as much attention to these things as I probably should 😜

I think that's the problem. Technical websites insisted on talking about FSB when talking about the Athlon64's HyperTransport bus, while technically, the thing that replaced it is inside the CPU, usually called by Intel/AMD generic names like "crossbar" or "ring bus". The FSB is much more than just the reference clock - it's the main data path in P6 systems.

getquake.gif | InfoWorld/PC Magazine Indices

Reply 38 of 81, by idspispopd

User metadata
Rank Oldbie
Rank
Oldbie
xjas wrote:

Maybe I'm interpreting this wrong... I just read the modern Core CPU is a (distant) relative of the Pentium-M, which is based on the old P6 (Pentium III/II/Pro/etc.)

So what happened to the Pentium-4 (P7?) Did it just vanish into history, or evolve into anything else? Obviously they "back"-ported some of its features (e.g. SSE2) into the P6... but I still remember the truly massive amount of hype around it when it was launched, so it's strange to see it obsoleted by its own predecessor.

They not only back-ported SSE2 but also the already mentioned quad-pumped FSB and also the multi-core feature (Core 2 Duo) and hyperthreading (Core i, if you still consider that P6 architecture - IMO it is more related to P6 than to Netburst). (Admittedly the dual-core Pentium D is not two integrated cores like the Core Duo, more like the two dual-cores slapped together in the Core 2 Quad.)

AFAIR Netburst was not only limited by the long pipeline, but also by the narrow decoder which was partially compensated by the L1 code cache being a trace cache. As long as code is executed from L1 everything is fine, but when running code residing only in L2 the decoder can be the bottleneck (L2 having very high bandwidth).

I suppose we don't have to talk about the Atom being the only architecture after P3 which is not Netburst or derived from P6.

Reply 39 of 81, by kanecvr

User metadata
Rank Oldbie
Rank
Oldbie

This thread gave me the urge to whip out my Pentium D 945 and Pentium 641 chips 😀 currently playing with them in a Zotac 780i SLi motherboard. The 945 will runs Quake 4 at 1280x1024 @ ultra extremely well with no lag whatsoever and it won;t go over 45 celsius under my Tuniq Tower cooler. I am also able to OC the 945 to 220x17 (3740MHz) with a tiny voltage boost (1.26v) and it's perfectly stable. Gonna try for 4GHZ (233x17) 😁 . At 3.74GHz, FPU Julia performance is a little under a Athlon X2 4000+ and all 2005-2006 games I tried run great.

I forgot to mention I run two Geforce 7950 512MB in SLi on said board. 3DMark scores aren't as high they would be on the 8400 I previously had in there, but games run well.

I'm actually going to whip out my E6600 as well and do some comparative benchmarking. I believe both the Pentium D 9xx series and E6xxx series came out the same year so this should be interesting.

Last edited by kanecvr on 2015-11-09, 12:34. Edited 2 times in total.