Was the P4 architecture a dead end?

Reply 40 of 81, by Scali

Posted on 2015-11-09, 10:58

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

idspispopd wrote:
They not only back-ported SSE2 but also the already mentioned quad-pumped FSB and also the multi-core feature (Core 2 Duo) and hyperthreading (Core i, if you still consider that P6 architecture - IMO it is more related to P6 than to Netburst). (Admittedly the dual-core Pentium D is not two integrated cores like the Core Duo, more like the two dual-cores slapped together in the Core 2 Quad.)

AFAIR Netburst was not only limited by the long pipeline, but also by the narrow decoder which was partially compensated by the L1 code cache being a trace cache. As long as code is executed from L1 everything is fine, but when running code residing only in L2 the decoder can be the bottleneck (L2 having very high bandwidth).

They did back-port trace cache to the Core-line as well, starting with Nehalem.
The difference is that it is now used only for loops. So they don't rely on trace cache for all execution (which was somewhat of a bottleneck, then again the Pentium 4 had only one decoder, where Core has 3 or 4).
They don't call it trace cache anymore, but it is the same thing, more or less. They have a 'loop stream detector', and when a loop is detected, instructions are cached in uOp form, rather than as x86 opcode bytes. The cache is just a lot smaller, because it only has to store one loop at a time (originally 28 uOps per thread, Skylake does 64 uOps per thread). So it is just a simple uOp instruction queue.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 41 of 81, by Gamecollector

Posted on 2015-11-09, 11:32

Gamecollector Offline

Rank Oldbie

Rank: Oldbie
Posts: 1413
Joined: 2010-10-06, 22:17
Location: RU

kanecvr wrote:
This thread gave me the urge to whip out my Pentium D 945 and Pentium 641 chips

SL9QB or SL9QQ?

Asus P4P800 SE/Pentium4 3.2E/2 Gb DDR400B,
Radeon HD3850 Agp (Sapphire), Catalyst 14.4 (XpProSp3).
Voodoo2 12 MB SLI, Win2k drivers 1.02.00 (XpProSp3).

Reply 42 of 81, by kanecvr

Posted on 2015-11-09, 12:34

kanecvr Offline

Rank Oldbie

Rank: Oldbie
Posts: 1957
Joined: 2015-04-22, 20:30
Location: Bucharest, Romania

Gamecollector wrote:
kanecvr wrote:
This thread gave me the urge to whip out my Pentium D 945 and Pentium 641 chips

SL9QB or SL9QQ?

C1 stepping, SL9QB

Unfortunatly my E6600 is not a Conroe, but a Wolfdale. I was sure I had a Conroe. No matter, the wolfdale is clocked at 3GHz and has 2mb cache while the Conroe has 4mb and is clocked slowe at 2.4GHz. I guess I can emulate a C2D e6600 by disabling two of the cores form my Q6600.

In the mean time, here are some results: *the pentium D is running at 3.74GHz (220x17) while the e6600 is running @ stock

CPU Queen:
2x E6600 = 12869
2x Pentium D 945 = 5512

CPU Photoworxx:
2x E6600 = 3101 mpixel/sec
2x Pentium D 945 = 2603 mpixel/sec

FPU VP8:
2x E6600 = 1816
2x Pentium D 945 =1276

FPU Julia:
2x E6600 = 3704
2x Pentium D 945 = 2314

Reply 43 of 81, by kanecvr

Posted on 2015-11-09, 12:53

kanecvr Offline

Rank Oldbie

Rank: Oldbie
Posts: 1957
Joined: 2015-04-22, 20:30
Location: Bucharest, Romania

Here are some Doom 3 benches as well:

The pentium D 945 at 3.74 GHZ:

1024x768 - high settings:

1280x1024 - ultra settings:

The pentium dual core (wolfdale) E6600 at 3.06 GHZ:

1024x768 - high settings:

1280x1024 - ultra settings:

as you can see, the Pentium D 945 is perfectly suitable for this game, but video cards are kind of bottlenecking the e6600 😁

Reply 44 of 81, by Standard Def Steve

Posted on 2015-11-10, 20:35

Standard Def Steve Offline

Rank Oldbie

Rank: Oldbie
Posts: 1432
Joined: 2012-09-15, 08:04

I always thought it went a little like this:
-PPro-PIII, P-M, and Core Duo were all P6 based
-Core 2 was its own thing, very different from Core Duo and earlier CPUs (and the very impressive SIMD performance and 64-bit capabilities proved it).
-First gen i7 (Nehalem) was very similar to Core 2, but added an integrated memory controller and L3 cache.
-The Sandy Bridge and later CPUs were a brand new design.

That being said, one thing I've noticed is that the PII, PIII, P-M, Core Duo, C2D/c2Q, and even my i7-4930K are all part of Intel Family 6 according to CPU-Z. P4 and P-D are the only CPUs not part of the 6 clan; they're listed as Family F!

94 MHz NEC VR4300 | SGI Reality CoPro | 8MB RDRAM | Each game gets its own SSD - nooice!

Reply 45 of 81, by kanecvr

Posted on 2015-11-10, 21:17

kanecvr Offline

Rank Oldbie

Rank: Oldbie
Posts: 1957
Joined: 2015-04-22, 20:30
Location: Bucharest, Romania

Yes, everything including Core i7 (nenalem, sandy, ivy, hanswell) is based on the P6 architecture to some degree.

Reply 46 of 81, by Scali

Posted on 2015-11-10, 21:26

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Standard Def Steve wrote:
That being said, one thing I've noticed is that the PII, PIII, P-M, Core Duo, C2D/c2Q, and even my i7-4930K are all part of Intel Family 6 according to CPU-Z. P4 and P-D are the only CPUs not part of the 6 clan; they're listed as Family F!

This is deliberate.
The reason for this is that the Intel compiler (and possibly others) can compile multiple versions of your code, optimized for different CPUs. On startup, it uses the CPUID instruction to determine which CPU it is running on, and it will then pick the optimized version for that CPU, if available.
The Pentium 4 is vastly different from a P6 architecture, so it needs very different code. Reporting a different CPU family takes care of that.
However, the Core 2 and newer CPUs are similar to the P6, and run that code very well. So Intel makes them report family 6, so that they appear as Pentium Pro-derivatives, allowing legacy code to pick the most optimal codepath for these CPUs.

TL;DR: The CPUID instruction reports does not necessarily report a new family even if it uses a different architecture.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 47 of 81, by Standard Def Steve

Posted on 2015-11-11, 17:41

Standard Def Steve Offline

Rank Oldbie

Rank: Oldbie
Posts: 1432
Joined: 2012-09-15, 08:04

That actually makes perfect sense. Learned something today.

94 MHz NEC VR4300 | SGI Reality CoPro | 8MB RDRAM | Each game gets its own SSD - nooice!

Reply 48 of 81, by BSA Starfire

Posted on 2015-11-11, 17:45

BSA Starfire Offline

Rank Oldbie

Rank: Oldbie
Posts: 923
Joined: 2014-03-22, 05:20
Location: UK

Makes you wonder how much code is still optimised for the P4, it does seem to often do better than the P-M in modern software when it really has no right too.

286 20MHz,1MB RAM,Trident 8900B 1MB, Conner CFA-170A.SB 1350B
386SX 33MHz,ULSI 387,4MB Ram,OAK OTI077 1MB. Seagate ST1144A, MS WSS audio
Amstrad PC 9486i, DX/2 66, 16 MB RAM, Cirrus SVGA,Win 95,SB 16
Cyrix MII 333,128MB,SiS 6326 H0 rev,ESS 1869,Win ME

Reply 49 of 81, by Scali

Posted on 2015-11-11, 17:47

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Standard Def Steve wrote:
That actually makes perfect sense. Learned something today.

I just checked my Celeron N2830, which is a Bay Trail-M architecture (Silvermont, so a cousin of Atom).
It also reports family 6, even though it is not related to the Pentium Pro.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 50 of 81, by BSA Starfire

Posted on 2015-11-11, 17:54

BSA Starfire Offline

Rank Oldbie

Rank: Oldbie
Posts: 923
Joined: 2014-03-22, 05:20
Location: UK

VIA C3 also reports as Family 6 and it is a long, long way from Pentium Pro, based on the C6 winchip.

286 20MHz,1MB RAM,Trident 8900B 1MB, Conner CFA-170A.SB 1350B
386SX 33MHz,ULSI 387,4MB Ram,OAK OTI077 1MB. Seagate ST1144A, MS WSS audio
Amstrad PC 9486i, DX/2 66, 16 MB RAM, Cirrus SVGA,Win 95,SB 16
Cyrix MII 333,128MB,SiS 6326 H0 rev,ESS 1869,Win ME

Reply 51 of 81, by Standard Def Steve

Posted on 2015-11-11, 18:06

Standard Def Steve Offline

Rank Oldbie

Rank: Oldbie
Posts: 1432
Joined: 2012-09-15, 08:04

OK, I just checked my S939 Opteron 185 and it's part of Family F, same as the P4! What the heck? 😦

BSA Starfire wrote:
Makes you wonder how much code is still optimised for the P4, it does seem to often do better than the P-M in modern software when it really has no right too.

I'm pretty sure it's because of hyper-threading. The software of today takes advantage of HT like nothing from the XP era ever could.

Windows 7 itself runs noticeably faster on two or more threads. There's a pretty major difference in Win7 and web browser performance between my single core 2.8GHz Athlon 64 and 3GHz dual-core Opteron, both on 939.

94 MHz NEC VR4300 | SGI Reality CoPro | 8MB RDRAM | Each game gets its own SSD - nooice!

Reply 52 of 81, by BSA Starfire

Posted on 2015-11-11, 18:20

BSA Starfire Offline

Rank Oldbie

Rank: Oldbie
Posts: 923
Joined: 2014-03-22, 05:20
Location: UK

I think there must be more to it than hyper threading/multi core though as many of the Atoms are dual core CPU's?

286 20MHz,1MB RAM,Trident 8900B 1MB, Conner CFA-170A.SB 1350B
386SX 33MHz,ULSI 387,4MB Ram,OAK OTI077 1MB. Seagate ST1144A, MS WSS audio
Amstrad PC 9486i, DX/2 66, 16 MB RAM, Cirrus SVGA,Win 95,SB 16
Cyrix MII 333,128MB,SiS 6326 H0 rev,ESS 1869,Win ME

Reply 53 of 81, by stamasd

Posted on 2015-11-13, 17:35

stamasd Offline

Rank l33t

Rank: l33t
Posts: 2030
Joined: 2014-08-31, 19:59
Location: Connecticut

Anonymous Coward wrote:
I would love to see some high-level internal memos from Intel leak out one day...even going as far back as iAPX and the 286 era.

iAPX432? There is no such thing, and there never was according to Intel. Seriously, they are known to deny that it ever existed, and you won't find any reference to it anywhere on any Intel-owned website.

I/O, I/O,
It's off to disk I go,
With a bit and a byte
And a read and a write,
I/O, I/O

Reply 54 of 81, by carlostex

Posted on 2015-11-13, 22:10

carlostex Offline

Rank l33t

Rank: l33t
Posts: 2411
Joined: 2010-04-03, 21:39
Location: Portugal

dr_st wrote:
Say all the bad things you want about P4s, but they had their place, and they had their moment in time when they were the top-of-the-line. And, if you disregard things like clock frequency and power usage, in terms of raw performance of the last of the single-core CPUs, a high-end P4 still wins against a high-end PM, and often even a high-end K8 (Athlon 64).

The only applications where Pentium 4's would win would be on the ones that used unfair compiler dispatching. In reality there's hardly a situation where P4 micro-architecture is better than K8 or PM.

Reply 55 of 81, by swaaye

Posted on 2015-11-13, 22:26

swaaye Online

Rank l33t++

Rank: l33t++
Posts: 8161
Joined: 2002-07-22, 21:24
Location: WI, USA

carlostex wrote:
The only applications where Pentium 4's would win would be on the ones that used unfair compiler dispatching. In reality there's hardly a situation where P4 micro-architecture is better than K8 or PM.

Pentium M wasn't really in the same class because of clock speed limits and wimpy FPU. Athlon 64 was ahead by a bit, but Pentium 4 was more of an Athlon XP challenger and it had no problem there. Pentium 4 / Pentium D / P4EE were still fairly competitive with A64 anyway.

What Core 2 did to Athlon 64 was worse than what Athlon 64 did to Pentium 4. 😁

Reply 56 of 81, by Scali

Posted on 2015-11-13, 22:36

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

carlostex wrote:
The only applications where Pentium 4's would win would be on the ones that used unfair compiler dispatching. In reality there's hardly a situation where P4 micro-architecture is better than K8 or PM.

It doesn't have to be unfair, and not everyone uses compilers.
The SSE2 units on a Pentium 4 were incredibly fast for its time. Some tasks, such as video encoding/decoding, were hand-optimized in assembly for Pentium 4 with SSE2, and even when the Core2 Duo came out, it still had its hands full in those particular applications.
See here, for example: http://www.anandtech.com/show/2045/12
Pentium D and XE are quite competitive there.

3dsmax7 rendering is another case making heavy use of well-optimized SSE2, where they aren't too bad:
http://www.anandtech.com/show/2045/11

They aren't doing that well in most other benchmarks.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 57 of 81, by xjas

Posted on 2015-11-14, 07:15

xjas Offline

Rank l33t

Rank: l33t
Posts: 2564
Joined: 2015-09-07, 02:29

Interesting discussion! Honestly I wasn't a fan of the P4 when it was current, but I got a bunch of them (boards + CPUs) for free a while ago and have been using them in things. I built one up as a games machine and I *do* think it punches above its weight a little, especially with HT on stuff that "requires" a dual-core (the way-overpowered video card helps too.)

The most recent thing I've gotten to run on it is the 2D version of the Vanguard V demo, which is only a year old. Runs at buttery smooth 60FPS the whole way through. My 2012 Macbook Pro (Intel HD4000 gfx) can't even do that.

twitch.tv/oldskooljay - playing the obscure, forgotten & weird - most Tuesdays & Thursdays @ 6:30 PM PDT. Bonus streams elsewhen!

Reply 58 of 81, by Sedrosken

Posted on 2015-11-14, 07:54

Sedrosken Offline

Rank Member

Rank: Member
Posts: 243
Joined: 2013-05-13, 13:53
Location: The Sticks

I still think the Northwood was the best generation of P4, the Prescotts ran too hot and the Willamettes were underwhelming performance-wise. If you could live without the SSE3, x86-64 support, and larger cache then the Northwood was almost always better performing because it didn't run quite as close to its thermal limits.

That said, the Pentium M still surprises me with how much fight it has. It still powers my daily-driver laptop, the 2GHz one I believe, and it handles Windows 7 like a champ. I'm sure if I bothered to replace the hard drive with something faster it would impress me even more, but it is in no way unsatisfactory except for when I try to queue up a YouTube video in the HTML5 codec. It doesn't like that very much at all, and I don't blame it one bit for that: it even likes to bind up my Phenom II. Give me the flash codec any day. Wow, never thought I'd say that.

Nanto: H61H2-AM3, 4GB, GTS250 1GB, SB0730, 512GB SSD, XP USP4
Rithwic: EP-61BXM-A, Celeron 300A@450, 768MB, GF2MX400/V2, YMF744, 128GB SD2IDE, 98SE (Kex)
Cragstone: Alaris Cougar, 486BL2-66, 16MB, GD5428 VLB, CT2800, 16GB SD2IDE, 95CNOIE

Reply 59 of 81, by carlostex

Posted on 2015-11-14, 19:32

carlostex Offline

Rank l33t

Rank: l33t
Posts: 2411
Joined: 2010-04-03, 21:39
Location: Portugal

Scali wrote:
It doesn't have to be unfair, and not everyone uses compilers. The SSE2 units on a Pentium 4 were incredibly fast for its time. […]
Show full quote

It doesn't have to be unfair, and not everyone uses compilers.
The SSE2 units on a Pentium 4 were incredibly fast for its time. Some tasks, such as video encoding/decoding, were hand-optimized in assembly for Pentium 4 with SSE2, and even when the Core2 Duo came out, it still had its hands full in those particular applications.
See here, for example: http://www.anandtech.com/show/2045/12
Pentium D and XE are quite competitive there.

3dsmax7 rendering is another case making heavy use of well-optimized SSE2, where they aren't too bad:
http://www.anandtech.com/show/2045/11

They aren't doing that well in most other benchmarks.

I didn't say P4 weren't fast with SSE2. The question is:

Do people around here realize that ICC compiler (which some mainstream and benchmark applications used to compile their sources) has incredibly unfair dispatch that relies on CPUID string? So image you write a bench application in C++ and you use certain specific libraries. You compile the code to optimize for SSE2.

AFAIK Athlon64 and Pentium 4 support SSE2 right? So here's what happens:

the compiler creates 2 independant code paths which are reliant on CPUID string. If a GenuineIntel CPU read returns true the optimized path with all SSE2 beels and whistles runs. If an AuthenticAMD is detected or any other CPU is detected (better yet: NON-INTEL) the optimal codepath is not used. In some cases 386 opcodes are used when better SIMD is available.

http://www.agner.org/optimize/blog/read.php?i=49#121

Anand Lal Shimpi was one of the tech journalists who decided to ignore all this debacle. I can understand Michael Dell lying in court to favor Intel but a tech journalist should be unbiased.

Interesting article as well showing VIA Nano's ability to change the CPUID string:

http://arstechnica.com/gadgets/2008/07/atom-nano-review/6/

Even more interesting is that compilers other than Intel can sometime exhibit the same problem:

http://www.extremetech.com/computing/193480-i … ing-shenanigans

I always take benchmarks with a grain of salt.

Main menu