VOGONS

Common searches


AMD drops the mic

Topic actions

Reply 80 of 279, by gdjacobs

User metadata
Rank l33t++
Rank
l33t++
Scali wrote:
No, this is a common misconception, going back to the days of the Pentium 4 (AMD claimed they didn't need the technology, becaus […]
Show full quote

No, this is a common misconception, going back to the days of the Pentium 4 (AMD claimed they didn't need the technology, because they didn't have the technology. Everyone bought into that story, except me. Because I do critical thinking).
Did you ever bother to think it through in the context of a modern x86?
A Core i7 has a lot of execution units per core. The legacy two-operand instructionset of x86 however is inadequate to feed all of these units every cycle.
This is not a 'pipeline stall' in the traditional sense, but you do have many units that are sitting idle every cycle.
By feeding two or more instruction streams, you can reach better utilization of these units.

CPUs starting with Core 2 were also able to benefit from wider vector instructions which I believe are much more beneficial in terms of performance. Sharing common resources between two (or more) threads is certainly a thing, though. Even AMD adopted that approach with the floating point unit of Piledriver, they just made it quite anemic. It just strikes me as more of a workaround for improperly vectorized code.

Scali wrote:

Which is why HT works at least as well on modern Core i7 CPUs as it did on the Pentium 4, while Core i7 is the complete opposite of the Pentium 4 in terms of pipeline design, stalls, and cost of these stalls.
(Why would Intel have brought back HT otherwise? It was gone from the Core2, because Core2 was not based on the Pentium 4 architecture, and had to be migrated. Which they did in Nehalem, and it has stayed there ever since. If there was no merit to it, it wouldn't have been here today, and AMD certainly wouldn't be trying to copy it).

My contention wasn't that it's completely ineffective, just that it's most profitable when it helps mitigate pipeline stalls. How large is the cost in die size. If it's inexpensive, it may be desirable even with more limited benefit. Plus, shorter pipelines do have a stall penalty, it's just less. And in some cases, SMT might help keep execution units stuffed.

You're right that I haven't looked at this in depth lately. Have you got any reference material for i7 which looks at performance deltas with SMT?

Scali wrote:

As for RISC, There's no such thing anymore. We are well into the post-RISC era, where even 'true' RISC architectures are now very similar to x86 in how they execute legacy code: A lot of instructions aren't implemented in hardware, but are decoded to microcode, or even emulated in software altogether.
x86 itself has of course also been using a RISC backend since the Pentium Pro era. The boundaries between the two are fading fast.
But if you want to argue about older RISC, if anything, the main characteristic of RISC has been very short and simple pipelines, with few and short stalls. Yet RISC is where you saw SMT first, on the DEC Alpha architecture. Which is where Intel got their HT technology from. Of course it was originally developed by IBM and also implemented on their POWER RISC architecture. Both have considerably shorter pipelines than the Pentium 4, and in fact, very similar to modern Core i7 pipelines (in the range of 14-16 stages, where Pentium 4 was 28+).

Ah, that's why I wasn't aware of SMT coming from the Alpha team. The sweet, sweet unreleased EV8.

I agree completely regarding the state of modern architectures. Essentially everyone has adopted a RISC backend. Intel and AMD simply have substantial hardware decoding strapped to the front end. That's what I meant by everyone riffing on the same tune these days.

Scali wrote:

You might want to update your knowledge, because it all sounds more than 10 years out of date.

Any pointers, just point them my way!

All hail the Great Capacitor Brand Finder

Reply 81 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
gdjacobs wrote:

CPUs starting with Core 2 were also able to benefit from wider vector instructions which I believe are much more beneficial in terms of performance.

That is completely separate from SMT technology.
Vector instructions need to be used explicitly by the software.
SMT on the other hand is just another way to handle multiple threads than regular multiprocessor or multicore technology. It does not require changes to the software.
Even vector instructions can benefit from SMT, because you can feed instructions from multiple threads. The key here is that each thread is by definition independent from any other threads at the instruction level.
The limits of out-of-order-execution are long chains of dependent instructions (using the same variables/registers). By using SMT, you get to feed extra independent instructions into your execution core.

gdjacobs wrote:

Sharing common resources between two (or more) threads is certainly a thing, though. Even AMD adopted that approach with the floating point unit of Piledriver, they just made it quite anemic. It just strikes me as more of a workaround for improperly vectorized code.

My theory on that has always been two-fold:
1) Building an SMT-capable x86-compatible processor is very complicated. There's so much legacy code out there, and so many corner cases (even Intel got a few things wrong with early implementations, leading to some potential security exploits).
2) SMT was patented technology, so you probably had to pay Intel and IBM a lot of money to implement full SMT.

Some of the patents have run out now. And AMD might have wanted to get to grips with SMT in two steps, rather than just one.

gdjacobs wrote:

My contention wasn't that it's completely ineffective, just that it's most profitable when it helps mitigate pipeline stalls.

If it is most profitable with pipeline stalls, then it would benefit the Pentium 4 far more than current CPUs, given that the Pentium 4 has a much longer pipeline, where stalls are far more expensive, and it is also a less sophisticated architecture, so it is not as good at branch prediction and prefetching and other ways to avoid or mitigate stalls as modern CPUs are.
That's my point. I can't make any sense from your statement other than saying it's ineffective on current CPUs. Which it isn't, as we can tell from benchmarks. As I said, HT gives the same or even better gains on modern CPUs compared to Pentium 4.

gdjacobs wrote:

How large is the cost in die size. If it's inexpensive, it may be desirable even with more limited benefit. Plus, shorter pipelines do have a stall penalty, it's just less.

There's the problem with your earlier assumption: You assumed that SMT mainly gets its performance gains from pipeline stalls. If we assume that to be true, then pipelines with less stall penalties, would by definition give SMT less opportunity to enhance performance. If the pipeline doesn't stall, there are no resources for a second thread to use, under your assumption.

gdjacobs wrote:

You're right that I haven't looked at this in depth lately. Have you got any reference material for i7 which looks at performance deltas with SMT?

I would think that Core i7 review on the usual tech sites (Extremetech, Anandtech, Tomshardware, Ars Technica etc) would usually include SMT scaling benchmarks.

gdjacobs wrote:

Intel and AMD simply have substantial hardware decoding strapped to the front end.

Modern 'RISC' CPUs have only slightly less substantial hardware decoding front ends. The only difference is that RISC has a few years 'headstart', because the RISC instructionsets are pretty clean and simple to decode. But like x86, they have been extended a lot over the years, and aren't implemented directly in hardware anymore. The backends do not resemble the original instructionset all that much anymore. They have added out-of-order execution, SMT and various other 'modern' trickery and optimizations, which decouple the backend from the frontend. Since the backend and frontend are decoupled anyway, it makes sense to optimize away instructions that are rarely used, and just implement them in microcode or even software fallbacks, to keep the execution core compact and efficient. People tend to refer to this as post-RISC.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 82 of 279, by gdjacobs

User metadata
Rank l33t++
Rank
l33t++
Scali wrote:

If it is most profitable with pipeline stalls, then it would benefit the Pentium 4 far more than current CPUs, given that the Pentium 4 has a much longer pipeline, where stalls are far more expensive, and it is also a less sophisticated architecture, so it is not as good at branch prediction and prefetching and other ways to avoid or mitigate stalls as modern CPUs are.
That's my point. I can't make any sense from your statement other than saying it's ineffective on current CPUs. Which it isn't, as we can tell from benchmarks. As I said, HT gives the same or even better gains on modern CPUs compared to Pentium 4.

Never said it was ineffective, just less effective. Compare Power6 to Power5 and Power7. Power5 and Power7 use OOO execution and SMT to keep processor resources maximally occupied. Power6 depends much more heavily on SMT as it can't reorder instructions to keep the pipeline stuffed.

Scali wrote:

There's the problem with your earlier assumption: You assumed that SMT mainly gets its performance gains from pipeline stalls. If we assume that to be true, then pipelines with less stall penalties, would by definition give SMT less opportunity to enhance performance. If the pipeline doesn't stall, there are no resources for a second thread to use, under your assumption.

SMT gets it's performance gains from keeping CPU resources maximally utilized. Pipeline stalls are just the most glaring examples with long pipelines and in order execution.

SMT also apparently has a negative impact with highly structured codebases that don't suffer highly from branch mis-prediction, cache misses, etc. Rybka, for instance.

Scali wrote:

I would think that Core i7 review on the usual tech sites (Extremetech, Anandtech, Tomshardware, Ars Technica etc) would usually include SMT scaling benchmarks.

Ars isn't nearly as good for this stuff since Jon Stokes left. Anandtech is almost the only media outlet that performs this type of deep architecture analysis. Aces Hardware used to be my favorite due to their excellent coverage of both PC and non PC architectures.

Scali wrote:

Modern 'RISC' CPUs have only slightly less substantial hardware decoding front ends. The only difference is that RISC has a few years 'headstart', because the RISC instructionsets are pretty clean and simple to decode. But like x86, they have been extended a lot over the years, and aren't implemented directly in hardware anymore. The backends do not resemble the original instructionset all that much anymore. They have added out-of-order execution, SMT and various other 'modern' trickery and optimizations, which decouple the backend from the frontend. Since the backend and frontend are decoupled anyway, it makes sense to optimize away instructions that are rarely used, and just implement them in microcode or even software fallbacks, to keep the execution core compact and efficient. People tend to refer to this as post-RISC.

True. The failure of RISC (except perhaps in the case of ARM or embedded MIPS) has been due to the decreasing significance of the decoder frontend relative to the rest of the transistor budget. The fundamental premise of RISC was that savings on the front end could be ploughed back into more execution capability, but this doesn't matter as much anymore.

All hail the Great Capacitor Brand Finder

Reply 83 of 279, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
gdjacobs wrote:

SMT also apparently has a negative impact with highly structured codebases that don't suffer highly from branch mis-prediction, cache misses, etc. Rybka, for instance.

That is why at the height of my chess programming I preferred an AMD Phenom X6 to the Core i7 with 4 cores (and 8 HT). Because for my chess engine needs I wanted 6 proper cores and not 8 half cores. There is (fairly) little cache misses or a lot of branch misprediction. Hell there isn't even any floating point math. It comes down to how fast can you AND, XOR, popcount and shift on as many cores as possible at once. And there the Phenom X6 did very good.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 84 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
gdjacobs wrote:

Never said it was ineffective, just less effective.

Same thing, being less effective doesn't show in the benchmarks.

gdjacobs wrote:

Compare Power6 to Power5 and Power7. Power5 and Power7 use OOO execution and SMT to keep processor resources maximally occupied. Power6 depends much more heavily on SMT as it can't reorder instructions to keep the pipeline stuffed.

It's one or the other... If it can't reorder instructions, it can't do OOO execution.
SMT can be implemented in multiple ways, by the way. The Atom series also have 'HT', but they are (or at least were) not OOO-architectures, so they had more of an interleaving approach. The decoder could look ahead one instruction, and then the pipeline logic would pick the most appropriate instruction for each cycle. So it was more about deciding which of the two threads to advance.
Of course you can't do direct comparisons of such different implementations. So I fail to see the relevance.

gdjacobs wrote:

SMT gets it's performance gains from keeping CPU resources maximally utilized. Pipeline stalls are just the most glaring examples with long pipelines and in order execution.

To people with no clue perhaps.
I already told you to look at the huge number of execution units in a modern x86, and the fact that the x86 instructionset is completely inadequate at utilizing all of them at the same time.
One of the biggest issues here is the two-operand model. There's always dependencies on your instructions, because one of your operands is both source and destination.
I always say x86 is like flying a Space Shuttle with a pair of tweezers.

gdjacobs wrote:

SMT also apparently has a negative impact with highly structured codebases that don't suffer highly from branch mis-prediction, cache misses, etc. Rybka, for instance.

That depends on a lot of factors.
Instruction mix is an interesting factor.
If you just start a bunch of threads that run the exact same code, and try to run it at the exact same time, that implies that they are all trying to run the same type of instructions, and are contesting for the same resources.
That is worst case for SMT of course (and best-case for GPGPU). And you'd be a pretty hopeless software engineer if you deliberately tried to run such code on SMT, expecting it to perform well.
If however you offset your threads, so you get a more interesting instruction mix, things start to pick up speed.
An interesting approach with SMT would be to make one thread with integer instructions, and another with FPU/SIMD instructions. Then run both threads on a physical core. They won't be contending for resources much this way.

But on a recent Core i7, with a recent OS (preferably Windows, I don't know how well other OSes deal with HT, but Windows has very good HT-aware scheduling at least), you'd have to work pretty hard to get any significant negative impact (negative of course being 'it runs faster if I disable HT', because you should see 4 cores with HT as being a 4 core CPU that may gain extra performance in certain cases. 4 core performance is the baseline. You can't compare with an 8-core CPU of course, because it doesn't physically have 8 cores. It has 4 cores, and a transistor count to match that core count).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 85 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
vladstamate wrote:

That is why at the height of my chess programming I preferred an AMD Phenom X6 to the Core i7 with 4 cores (and 8 HT). Because for my chess engine needs I wanted 6 proper cores and not 8 half cores. There is (fairly) little cache misses or a lot of branch misprediction. Hell there isn't even any floating point math. It comes down to how fast can you AND, XOR, popcount and shift on as many cores as possible at once. And there the Phenom X6 did very good.

Which is not surprising at first sight.
A CPU with 6 cores simply has more execution units than one with 4 cores.
If we were to assume that they were the same architecture, then a 4-core CPU with HT would need to yield 50% extra performance from HT just to get to the base level of a 6-core. That's a pretty tall order obviously.
I wouldn't chalk that up as a deficiency of HT.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 86 of 279, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:
Which is not surprising at first sight. A CPU with 6 cores simply has more execution units than one with 4 cores. If we were to […]
Show full quote

Which is not surprising at first sight.
A CPU with 6 cores simply has more execution units than one with 4 cores.
If we were to assume that they were the same architecture, then a 4-core CPU with HT would need to yield 50% extra performance from HT just to get to the base level of a 6-core. That's a pretty tall order obviously.
I wouldn't chalk that up as a deficiency of HT.

Agreed. My comment was not meant as a slight on the HT capability/usefulness. Just as a support to what gdjacobs said about HT not having so much gain (still some, not 0) on certain type of code base. I admit my use case is a corner case though and in general I would much rather have and Intel CPU with HT than not.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 87 of 279, by gdjacobs

User metadata
Rank l33t++
Rank
l33t++
vladstamate wrote:

Agreed. My comment was not meant as a slight on the HT capability/usefulness. Just as a support to what gdjacobs said about HT not having so much gain (still some, not 0) on certain type of code base. I admit my use case is a corner case though and in general I would much rather have and Intel CPU with HT than not.

This is the evaluation of Rybka users and developers, not mine. Presumably the codebase is optimized and predictable enough that the ALUs can be kept at full boil without sharing them amongst different threads.
http://rybkaforum.net/cgi-bin/rybkaforum/topi … ow.pl?tid=17836
http://www.agner.org/optimize/blog/read.php?i=6#103

All hail the Great Capacitor Brand Finder

Reply 88 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
gdjacobs wrote:

This is the evaluation of Rybka users and developers, not mine. Presumably the codebase is optimized and predictable enough that the ALUs can be kept at full boil without sharing them amongst different threads.
http://rybkaforum.net/cgi-bin/rybkaforum/topi … ow.pl?tid=17836
http://www.agner.org/optimize/blog/read.php?i=6#103

It all sounds so over-simplified.
There can be many reasons why HT may or may not get significant gains on a particular piece of code. A CPU is far more than ALUs.
There's caching, TLBs, decoding bandwidth, other threads/processes affecting your code and many many other aspects that affect code.
I don't care to get into this kind of 'discussion' where things are so simplified and shoehorned into some kind of black/white view, that things lose all meaning.
Nor do I care to put any effort into studying some kind of random application that I've never heard of and have no need for, to find out how it may or many not be optimized properly, and how its developers may or may not have an actual clue.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 90 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
gdjacobs wrote:

Well, empirical results are usually quite condensed.

How can I tell from empirical results whether the code is properly optimized or runs head-first into some obvious pitfall?
Also, yay for 7-year old threads.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 91 of 279, by archsan

User metadata
Rank Oldbie
Rank
Oldbie

Just another tease... but this was the OFFICIAL LAUNCH and they're boasting some SERIOUS CLAIMS there. How could they? Is this the biggest scam in history? ... An IPC improvement of 52%, surpassing the 40% target (granted, that's over the poor predecessor). Outperforming $1k i7-6900K. Yadda yadda yadda. More salt to Intel diehards I'd say. Either way I'm enjoying this. 😎

But a spare $500 that you can put into RAM/SSD/graphics/other stuff? Definitely not running to buy an Intel this weekend. Not even that 7700K.

http://www.tomshardware.com/news/amd-ryzen-7- … 800x,33702.html
http://www.anandtech.com/show/11143/amd-launc … -sale-march-2nd

😁

"Any sufficiently advanced technology is indistinguishable from magic."—Arthur C. Clarke
"No way. Installing the drivers on these things always gives me a headache."—Guybrush Threepwood (on cutting-edge voodoo technology)

Reply 92 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
archsan wrote:

Outperforming $1k i7-6900K.

As if that CPU really has to cost $1k.
Wake me up when Intel doesn't adjust its prices to remain competitive.
And another thing: that's a 6xxx series, as in Broadwell-based architecture.
They haven't refreshed their 6 and 8-core architectures yet, because of little interest. The market mainly demanded 4-core CPUs so far.
So they're basically comparing to Intel CPUs that are two(!) generations behind.

Let's just say that I wouldn't be surprised if Kaby Lake still had the edge in IPC, and 4c/8t performs well enough in most consumer software that AMD's 8c/16t doesn't actually make a compelling argument for the average consumer (just like they never were really interested in the above 6xxx-series with 6/8c).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 93 of 279, by archsan

User metadata
Rank Oldbie
Rank
Oldbie

^Oh, I thought you said competition was not needed? Sry if I misunderstood your words though, I was just skimming.

"Any sufficiently advanced technology is indistinguishable from magic."—Arthur C. Clarke
"No way. Installing the drivers on these things always gives me a headache."—Guybrush Threepwood (on cutting-edge voodoo technology)

Reply 94 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
archsan wrote:

^Oh, I thought you said competition was not needed? Sry if I misunderstood your words though, I was just skimming.

Pretty sure this post already said that Intel can lower prices if need be: AMD drops the mic

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 95 of 279, by archsan

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

Pretty sure this post already said that Intel can lower prices if need be: AMD drops the mic

OK, I wonder, why do you see it from Intel, the corporation's perspective though?

I mean.. y'know, we as consumers usually speak from the consumer's perspective. At least that's what I do. Competition = better services, better prices etc.

Here's the situation, for example: 65W (though I'm still a bit skeptic on that) 8-core R7 1700 vs 91W 4-core i7 7700K, same prices. Do I as a non-academic consumer really need to know which one has the better IPC? Damn nopes. But as a well-informed consumer, I'll just see the end results, price/perf, efficiency etc as usual. I mean this is just a tool we're talking about.

If I'm a gamer and current games turn out to play better with the 7700K, that's where the money should go. But if what I care about are multithreaded applications that work faster with the extra 4 cores, then there it should go. Pretty sure I'm not the only person having this rather simplistic perspective.

As to the 5960X/6900K, the $1K barrier has now been thrashed due to competition (however late it is) and 8-core is now going to become mainstream. Is this not a happy news for consumers? Developers? Anyway, more benches in a week.

"Any sufficiently advanced technology is indistinguishable from magic."—Arthur C. Clarke
"No way. Installing the drivers on these things always gives me a headache."—Guybrush Threepwood (on cutting-edge voodoo technology)

Reply 96 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
archsan wrote:

OK, I wonder, why do you see it from Intel, the corporation's perspective though?

Do I?
I'm just making the observation that Intel is selling their more high-end CPUs at a considerable profit margin.
I don't see what that has to do with any kind of "perspective", since as far as I'm concerned, it is an objective observation, based on the facts that Intel's CPUs have a rather exponential price buildup, which seems to bear no relation to die size or manufacturing cost.

archsan wrote:

As to the 5960X/6900K, the $1K barrier has now been thrashed due to competition (however late it is) and 8-core is now going to become mainstream. Is this not a happy news for consumers? Developers? Anyway, more benches in a week.

I think you are jumping to conclusions rather quickly. 8-core CPUs may have become 'affordable', but that does not necessarily make them mainstream.
Heck, Intel's 6-core offerings have been at those price ranges for quite a while now, but they haven't seen much of an uptake so far. So I'm not too sure if 8-cores will suddenly take off now.
Perhaps the majority will just be happy that they can now pay less for a 4c/8t CPU.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 97 of 279, by archsan

User metadata
Rank Oldbie
Rank
Oldbie

Er, right. By "mainstream" what I had in mind was actually either workstation crowds or gamers/"enthusiasts" (where the PC market has been actually growing despite what Apple is saying...). And having competent 4-cores replacing dual-cores everywhere would be most welcome still.

"Any sufficiently advanced technology is indistinguishable from magic."—Arthur C. Clarke
"No way. Installing the drivers on these things always gives me a headache."—Guybrush Threepwood (on cutting-edge voodoo technology)

Reply 98 of 279, by Scali

User metadata
Rank l33t
Rank
l33t
archsan wrote:

Er, right. By "mainstream" what I had in mind was actually either workstation crowds or gamers/"enthusiasts" (where the PC market has been actually growing despite what Apple is saying...).

Funny, the rest of the world seems to consider workstation and enthusiast (possibly more high-end gamers as well) to be market segments above mainstream: http://jonpeddie.com/download/media/slides/An … _GPU_Market.pdf

Desktop, for example, have four sub segments: mainstream, performance, enthusiast, and workstation.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 99 of 279, by archsan

User metadata
Rank Oldbie
Rank
Oldbie

I was conceding my little 'slip of mind', and now you want to beat that out? Ooohh...

#triggered

"Any sufficiently advanced technology is indistinguishable from magic."—Arthur C. Clarke
"No way. Installing the drivers on these things always gives me a headache."—Guybrush Threepwood (on cutting-edge voodoo technology)