VOGONS


Was the P4 architecture a dead end?

Topic actions

Reply 60 of 81, by Scali

User metadata
Rank l33t
Rank
l33t

What is 'unfair'? A little knowledge can be a dangerous thing.
Which is why I wrote a blog about this a few years ago, which tells the *whole* story: https://scalibq.wordpress.com/2010/01/05/inte … -fanboy-idiocy/

Most important here: hardly any software uses the ICC in the first place.

As for your 'theory' on SSE2: I have to disappoint you. As I said, this is *hand-optimized assembly*, not compiler-generated.
Just because Athlon64 also implements SSE2 doesn't mean it's just as fast. The Pentium really benefits here from its large caches, aggressive prefetching and the fact that the SSE2 units are clocked that high.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 61 of 81, by swaaye

User metadata
Rank l33t++
Rank
l33t++

Yeah I've seen it mentioned several times that Athlon 64's SSE2 is not very impressive compared to P4. x264 refers to Athlon 64 SSE2 as "SSE2Slow".

And I remember this.
https://web.archive.org/web/20040218085414/ht … jsp?id=60000258

For that matter, Pentium M and Core 1 don't perform well with SSE2 either.

Reply 62 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
swaaye wrote:

Yeah I've seen it mentioned several times that Athlon 64's SSE2 is not very impressive compared to P4. x264 refers to Athlon 64 SSE2 as "SSE2Slow".

And I remember this.
https://web.archive.org/web/20040218085414/ht … jsp?id=60000258

Yes, also pay attention to the Athlon XP 3200+ there, with SSE.
Although the XP supports SSE, the performance is very poor, and 3dnow! is actually much faster on these CPUs (about the same as the Athlon 64 3200+).

Or look at the Pentium 4 scalar SSE2 results. The hand-optimized x87 code is actually faster there.

Which is what my article is about: you can't just detect whether a CPU supports 'SSE' or 'SSE2' or whatever, and then run some pre-baked code for that, assuming it will be faster.
This is why the Intel compiler does a lot more than just checking for support of certain instructions. It needs to know *exactly* what CPU it's running on, to make sure it knows the fastest codepath. But they can only do that for their own products (which should be pretty obvious anyway, being the Intel Compiler).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 63 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
swaaye wrote:

Funny by the way that this review includes ScienceMark as well...
As my blog points out, ScienceMark has been modified by AMD employees. And it shows pretty much the same thing as Sysmark, but the other way around: heavily inflated benchmark scores for AMD.
I link to these results in my blog: http://www.extremetech.com/computing/50878-in … top-to-bottom/7
As you can see, the Athlon64 FX-62 beats all but the X6800 there.
This is way inflated compared to every other benchmark out there, and completely unrepresentative of the CPU's actual performance level.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 65 of 81, by kanecvr

User metadata
Rank Oldbie
Rank
Oldbie
swaaye wrote:

Car engine discussion moved to milliways.

Very appropriate and somehow funny 😁 Kind of noticed I inadvertently steered the thread towards car engine discussions and I was wondering if anyone would ask us to change the subject.

Reply 66 of 81, by Standard Def Steve

User metadata
Rank Oldbie
Rank
Oldbie

This is a very interesting thread. I didn't know the P4's SIMD was ahead of K8's. It really doesn't seem to matter much outside of synthetic tests though. Even in the SSE2-heavy computing landscape of today my dual-core K8 @ 3GHz quite easily surpasses my 3.2GHz PD-935 in, well, everything. For example, 1080p HTML5/VP9 video is doable on the K8 without dropped frames, while the P-D tops out at 720p. But to be fair, Pentium M does even worse than Pentium D; even 720p is choppy on the Dothan.

Scali wrote:
The SSE2 units on a Pentium 4 were incredibly fast for its time. Some tasks, such as video encoding/decoding, were hand-optimize […]
Show full quote

The SSE2 units on a Pentium 4 were incredibly fast for its time. Some tasks, such as video encoding/decoding, were hand-optimized in assembly for Pentium 4 with SSE2, and even when the Core2 Duo came out, it still had its hands full in those particular applications.
See here, for example: http://www.anandtech.com/show/2045/12
Pentium D and XE are quite competitive there.

3dsmax7 rendering is another case making heavy use of well-optimized SSE2, where they aren't too bad:
http://www.anandtech.com/show/2045/11

They aren't doing that well in most other benchmarks.

But wasn't Core 2's SSE2 performance up to twice as fast as Netburst? It's been a while, so I might be completely wrong about this, but I remember reading something, somewhere (anandtech maybe?) about Core 2 having a 128-bit SIMD unit, whereas Netburst (and K8) were stuck with 64-bit SIMD. I can't remember if software had to be rewritten to fully take advantage of Core 2's wider SSE2 unit.

As for the linked benchmark results, I think the only reason the PXE-965 looks good there is because it's the only 2c/4t CPU of the bunch. The regular 2C/2T 3.6GHz Pentium D is only clocked 133MHz lower than the XE, yet it gets outperformed by the 1.86GHz Core 2 E6300 in all of the encoding and 3D rendering tests.

94 MHz NEC VR4300 | SGI Reality CoPro | 8MB RDRAM | Each game gets its own SSD - nooice!

Reply 67 of 81, by swaaye

User metadata
Rank l33t++
Rank
l33t++
Standard Def Steve wrote:

But wasn't Core 2's SSE2 performance up to twice as fast as Netburst? It's been a while, so I might be completely wrong about this, but I remember reading something, somewhere (anandtech maybe?) about Core 2 having a 128-bit SIMD unit, whereas Netburst (and K8) were stuck with 64-bit SIMD. I can't remember if software had to be rewritten to fully take advantage of Core 2's wider SSE2 unit.

http://techreport.com/review/10351/intel-core … e-processors/15

Reply 68 of 81, by Standard Def Steve

User metadata
Rank Oldbie
Rank
Oldbie
swaaye wrote:
Standard Def Steve wrote:

But wasn't Core 2's SSE2 performance up to twice as fast as Netburst? It's been a while, so I might be completely wrong about this, but I remember reading something, somewhere (anandtech maybe?) about Core 2 having a 128-bit SIMD unit, whereas Netburst (and K8) were stuck with 64-bit SIMD. I can't remember if software had to be rewritten to fully take advantage of Core 2's wider SSE2 unit.

http://techreport.com/review/10351/intel-core … e-processors/15

Oh dear god. There we go. 😁

94 MHz NEC VR4300 | SGI Reality CoPro | 8MB RDRAM | Each game gets its own SSD - nooice!

Reply 69 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
Standard Def Steve wrote:

Even in the SSE2-heavy computing landscape of today my dual-core K8 @ 3GHz quite easily surpasses my 3.2GHz PD-935 in, well, everything.

Well, that's rather obvious, isn't it? K8 didn't run anywhere near 3 GHz back when the Pentium D was around.
Sure, if you have near-parity in clockspeed, K8 will be faster. But back when the Pentium D was around, it had a huge lead in clockspeed over its competitors.

Standard Def Steve wrote:

But wasn't Core 2's SSE2 performance up to twice as fast as Netburst?

No.

Standard Def Steve wrote:

but I remember reading something, somewhere (anandtech maybe?) about Core 2 having a 128-bit SIMD unit, whereas Netburst (and K8) were stuck with 64-bit SIMD.

Yes, but again, there was no clockspeed-parity. The Core2 needed that wider unit to compensate for its lower clockspeed.

Standard Def Steve wrote:

I can't remember if software had to be rewritten to fully take advantage of Core 2's wider SSE2 unit.

No, SSE/SSE2 were always designed to be 128-bit. But early implementations would split up the operations in two 64-bit operations internally.

Standard Def Steve wrote:

The regular 2C/2T 3.6GHz Pentium D is only clocked 133MHz lower than the XE, yet it gets outperformed by the 1.86GHz Core 2 E6300 in all of the encoding and 3D rendering tests.

Yes, but we were comparing K8, not Core2.
I merely pointed to benchmarks that also included Core2, showing that even Core2 with its twice-as-wide SSE-implementation was still having some trouble with Pentium 4/D.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 70 of 81, by swaaye

User metadata
Rank l33t++
Rank
l33t++

Yeah those high-clocked Preslers have their moments.

On the other hand sometimes 1.86 GHz E6300 matches the $1000 wonders of 2005. I ran one of those at 3.2 GHz on stock voltage for a few years. Forget the Celeron A, it's all about legendary E6300. 🤣

Reply 71 of 81, by brassicGamer

User metadata
Rank Oldbie
Rank
Oldbie
swaaye wrote:

Yeah those high-clocked Preslers have their moments.

On the other hand sometimes 1.86 GHz E6300 matches the $1000 wonders of 2005. I ran one of those at 3.2 GHz on stock voltage for a few years. Forget the Celeron A, it's all about legendary E6300. 🤣

The E5200 also had an excellent reputation for overclocking - you can get them for a fiver on eBay at the moment. Looks like I won't be selling mine then - I'll be happy with the decision in 20 years.

Check out my blog and YouTube channel for thoughts, articles, system profiles, and tips.

Reply 72 of 81, by Standard Def Steve

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:
Standard Def Steve wrote:

Even in the SSE2-heavy computing landscape of today my dual-core K8 @ 3GHz quite easily surpasses my 3.2GHz PD-935 in, well, everything.

Well, that's rather obvious, isn't it? K8 didn't run anywhere near 3 GHz back when the Pentium D was around.
Sure, if you have near-parity in clockspeed, K8 will be faster. But back when the Pentium D was around, it had a huge lead in clockspeed over its competitors.

It's actually a 2.6GHz Opteron 185 for S939, which was around during the Pentium D days. I just happen to have it overclocked to 3GHz because it handles well @ default voltage. But even at stock clocks, it outperforms the 3.2 P-D. 1080 VP9 is still playable on it.

Scali wrote:
Standard Def Steve wrote:

But wasn't Core 2's SSE2 performance up to twice as fast as Netburst?

No.

It sure seems that pure SSE2 can sometimes be over twice as fast, at least in the synthetic benchmarks swaaye linked to. 😀

Scali wrote:
Standard Def Steve wrote:

The regular 2C/2T 3.6GHz Pentium D is only clocked 133MHz lower than the XE, yet it gets outperformed by the 1.86GHz Core 2 E6300 in all of the encoding and 3D rendering tests.

Yes, but we were comparing K8, not Core2.
I merely pointed to benchmarks that also included Core2, showing that even Core2 with its twice-as-wide SSE-implementation was still having some trouble with Pentium 4/D.

A 1.86GHz Core 2 beating a 3.6GHz P-D in SSE-heavy media encoding and rendering tests is considered to be having trouble? If it had hyper-threading, it would be beating that P-XE as well.

Standard Def Steve wrote:

But to be fair, Pentium M does even worse than Pentium D; even 720p is choppy on the Dothan.

Replying to myself here. Adding a second core to the Pentium M makes it more than competitive. I pulled out my mid-2006 iMac, powered by a 2GHz Core Duo "Yonah" and running Win7. Surprisingly enough, this machine had no problem handling 1080p VP9 in Chrome. CPU usage was at around 90-95%, but other than at the very beginning of the clip, it didn't drop frames. So yeah, at least in video decode performance, it outperformed the PD-935. I dunno; it just seems that pure SSE performance doesn't matter too, too much outside of synthetics and maybe x264. Unless Google's video decoder isn't SSE heavy.

94 MHz NEC VR4300 | SGI Reality CoPro | 8MB RDRAM | Each game gets its own SSD - nooice!

Reply 73 of 81, by dr_st

User metadata
Rank l33t
Rank
l33t
Standard Def Steve wrote:
Standard Def Steve wrote:

But to be fair, Pentium M does even worse than Pentium D; even 720p is choppy on the Dothan.

Replying to myself here. Adding a second core to the Pentium M makes it more than competitive. I pulled out my mid-2006 iMac, powered by a 2GHz Core Duo "Yonah" and running Win7. Surprisingly enough, this machine had no problem handling 1080p VP9 in Chrome.

It's not so surprising. As I said earlier, from my experience, even a P4-HT does a bit better than a P-M in typical office workload, and that's without an actual second core, only with the hyperthreading. Nothing strange that a real dual core P-D improves on top of that, or that a Core Duo, which is essentially a dual-core P-M (two cores, and a better architecture) goes even further.

https://cloakedthargoid.wordpress.com/ - Random content on hardware, software, games and toys

Reply 74 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
Standard Def Steve wrote:

It's actually a 2.6GHz Opteron 185 for S939, which was around during the Pentium D days. I just happen to have it overclocked to 3GHz because it handles well @ default voltage. But even at stock clocks, it outperforms the 3.2 P-D. 1080 VP9 is still playable on it.

It's a faster CPU in general, that's not the point. We were looking at the SSE2-performance only.

Standard Def Steve wrote:

It sure seems that pure SSE2 can sometimes be over twice as fast, at least in the synthetic benchmarks swaaye linked to. 😀

Are we looking at the same benchmarks? Because the ones I'm looking at, have two mandelbrot benchmarks. One is integer, the other is SSE2. Core2 Duo totally cleans up in the integer one, but with SSE2, Pentium D is a whole lot closer.
Of course this is not a 'pure SSE2' benchmark, since it is a complete mandelbrot program. There's still a lot of non-SSE2 code in there, where the Pentium D takes a hit. If you were purely benchmarking the SSE2 code, you'd see that Pentium D is doing quite well in that part of the code, and losing most in the integer-part (as should be obvious from the comparison with the integer version).

Standard Def Steve wrote:

A 1.86GHz Core 2 beating a 3.6GHz P-D in SSE-heavy media encoding and rendering tests is considered to be having trouble?

Given the overwhelming performance advantage that Core2 has over Pentium D in most other benchmarks, yes, I'd say it's struggling here. Relatively speaking of course.
Also, please don't do any comparisons on clock speed. I thought we've dealt with the MHz myth already...

Standard Def Steve wrote:

So yeah, at least in video decode performance, it outperformed the PD-935. I dunno; it just seems that pure SSE performance doesn't matter too, too much outside of synthetics and maybe x264. Unless Google's video decoder isn't SSE heavy.

It tends to matter more in encoding than in decoding. But then there are still codecs that aren't very well-optimized.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 75 of 81, by swaaye

User metadata
Rank l33t++
Rank
l33t++

There's something else to ponder - Itanium. It was developed concurrently with these P3 and P4 projects. Intel probably wanted desktop Itanium at some point. Itanium certainly didn't go as well for them as P4.

And also, since CPU design takes years, Conroe was certainly no kneejerk recovery move. It probably started design in 2002 or so. But perhaps it was initially intended as another mobile-only chip? It's loaded with power saving technologies, and the missing hyperthreading was a curiosity.

You have to wonder how many projects Intel runs at once and what's been canned over the years.

Reply 77 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
swaaye wrote:

There's something else to ponder - Itanium. It was developed concurrently with these P3 and P4 projects. Intel probably wanted desktop Itanium at some point. Itanium certainly didn't go as well for them as P4.

Yea, I said the same thing here 😀

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 79 of 81, by Scali

User metadata
Rank l33t
Rank
l33t
F2bnp wrote:

So, do you guys think that maybe Intel meant for the Pentium 4 to be a stop-gap product, with an Itanium compatible desktop design coming up after a few years?

Yup.
In fact, Intel was already there, most of the way. If you look at this HP Itanium workstation for example: https://en.wikipedia.org/wiki/Itanium#/media/ … kstation_12.jpg
It already is a desktop system, and it can run the Itanium-version of Windows.

The main problem was cost at that point. But that would come down as soon as people started demanding Itaniums (there would be no path to 64-bit with x86, so you'd have to go Itanium at some point). And Intel could also do some cost-cutting still for more low-end desktop CPUs, by removing some of the cache.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/