Incredible how even fairly modern processors don't cut it

Reply 40 of 57, by Scali

Posted on 2016-06-27, 13:32

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Kahenraz wrote:
... for the advantage of higher integer IPC throughput per-process without the additional need of a hypervisor or marshal service.

etc..

That doesn't mean anything to me.
What does a hypervisor or 'marshal service' have to do with all this?
No offense, but it sounds like random namedropping of tech terms.
A hypervisor is a mechanism for managing virtual machines: https://en.wikipedia.org/wiki/Hypervisor
How does that have anything to do with IPC whatsoever?

As for 'marshal service'...
The only use of the term 'marshalling' I am familiar with within the realm of computer science is this: https://en.wikipedia.org/wiki/Marshalling_(computer_science)
This is generally done when two different types of systems/languages have to communicate with eachother, and translation of objects/parameters is required.
Again, I don't see how any of that is even remotely related to IPC.

So you'd have to explain the following:
1) How can a 'hypervisor' or 'marshaling service' affect 'IPC throughput per-process' (sic) in the first place?
2) How does the AMD FX differ from other CPUs here in not requiring these for 'higher IPC throughput per-process'?
Because that is what you are implying, isn't it?

Edit: my personal evaluation of AMD's FX line is well-documented, and can basically be summed up as follows:
It has less decoder bandwidth and less ALUs/FPU/SIMD units available per-thread than older AMD architectures and Intel architectures, and therefore it obviously cannot reach the same levels of IPC (they'd have to do more with less to break-even). There is nothing you can do about that.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 41 of 57, by Kahenraz

Posted on 2016-06-27, 18:04

Kahenraz Offline

Rank l33t

Rank: l33t
Posts: 4586
Joined: 2004-01-22, 04:57

You don't have to understand it. I was simply giving an example as to why the FX was a good match for my personal use case.

Reply 42 of 57, by Scali

Posted on 2016-06-27, 19:15

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Kahenraz wrote:
You don't have to understand it. I was simply giving an example as to why the FX was a good match for my personal use case.

Except I studied CS and have been doing assembly-level optimizations for decades on various x86 and other architectures.
I would say I am pretty well-versed in the matter, and it's my job to know about IPC and how to extract the most from a given CPU.
I should be able to understand your statements. Yet I don't. And you can't explain them.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 43 of 57, by ynari

Posted on 2016-06-29, 11:54

ynari Offline

Rank Member

Rank: Member
Posts: 430
Joined: 2014-05-29, 12:38
Location: Manchester, UK

Note also that's a Conroe architecture processor. I'm about to go from a Q6700 (Kentsfield), to a X3370 (Yorkfield). There's a 300MHz clock and 333MHz FSB difference, but for a couple of benchmarks the performance of the X3370 is much higher than expected.

The reason? Kentsfield has SSSE 3, whilst Yorkfield has SSE4.1. For some applications, this makes a considerable difference.

Reply 44 of 57, by Scali

Posted on 2016-06-29, 13:06

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

ynari wrote:
Note also that's a Conroe architecture processor. I'm about to go from a Q6700 (Kentsfield), to a X3370 (Yorkfield). There's a 300MHz clock and 333MHz FSB difference

I 'cheated' on my Conroe E6600 though 😀
It was 2.4 GHz stock, at 1066 MHz bus.
But it had no trouble running at 1333 MHz bus, giving me 3 GHz clockspeed. The faster bus made quite a difference.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 45 of 57, by Snayperskaya

Posted on 2016-06-30, 18:17

Snayperskaya Offline

Rank Member

Rank: Member
Posts: 277
Joined: 2015-05-10, 04:27

Anything above a C2D-class is overkill unless you are into heavy gaming (newer games @1080p or higher with 60 FPS), 3d modeling or anything that makes use of the extra power. One of the best upgrades for a daily use PC is a SSD (and the required memory so your system don't swap all the time - 8GB+ for no swap at all).

Last edited by Snayperskaya on 2016-06-30, 18:23. Edited 1 time in total.

Reply 46 of 57, by PhilsComputerLab

Posted on 2016-06-30, 18:22

PhilsComputerLab Offline

Rank l33t++

Rank: l33t++
Posts: 6174
Joined: 2014-09-28, 03:33
Location: Western Australia

Snayperskaya wrote:
Anything above a C2D-class is overkill unless you are into heavy gaming (newer games @1080p or higher with 60 FPS), 3d modeling or anything. One of the best upgrades for a daily use PC is a SSD (and the required memory so your system don't swap all the time - 8GB+ for no swap at all).

Agreed on the SSD!

Prices have really come down and if you can wait for a sale / deal, they can be had for little money.

YouTube, Facebook, Website

Reply 47 of 57, by agent_x007

Posted on 2016-07-01, 06:40

agent_x007 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1691
Joined: 2016-01-19, 11:06

Kahenraz wrote:
You don't have to understand it. I was simply giving an example as to why the FX was a good match for my personal use case.

It's "good match", because it has more ALU resources (ie. eight actual Integer threads), BUT more threads does NOT equall, better IPC.
Also it will suck in Float/FPU operations because these require an FPU unit which is shared between two Integer threads.

Basicly : It's the difference between Thread level paralelism and Instruction level paralelism.
First one needs for example Hyper Threading or "more Modules" in case of AMD's Bulldozer type architecture, and latter one requires bigger buffers for OoO entries, faster decoding stage, and more ALU/FPU resources (available for each, ONE, thread of any given CPU), etc.
Hope I didn't messed up TLP and ILP definitions in translation 😀

Reply 48 of 57, by Scali

Posted on 2016-07-01, 07:37

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

agent_x007 wrote:
BUT more threads does NOT equall, better IPC.

More threads also does not necessarily mean more ALU resources.
Eg, a Phenom X6 has 6 cores with 3 ALUs each. That is 18 ALUs in total. An FX8xxx has 8 cores with 2 ALUs each, which is only 16 ALUs.

Aside from that, more ALU resources does not necessarily mean more performance.
Eg, a Core i7 has 3 ALUs per core and 4 cores. So you have only 12 ALUs. But it outperforms the Phenom X6 and FX 8-cores in pretty much everything.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 49 of 57, by agent_x007

Posted on 2016-07-01, 09:31

agent_x007 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1691
Joined: 2016-01-19, 11:06

Scali wrote:
More threads also does not necessarily mean more ALU resources. Eg, a Phenom X6 has 6 cores with 3 ALUs each. That is 18 ALUs in […]
Show full quote
More threads also does not necessarily mean more ALU resources.
Eg, a Phenom X6 has 6 cores with 3 ALUs each. That is 18 ALUs in total. An FX8xxx has 8 cores with 2 ALUs each, which is only 16 ALUs.

Aside from that, more ALU resources does not necessarily mean more performance.
Eg, a Core i7 has 3 ALUs per core and 4 cores. So you have only 12 ALUs. But it outperforms the Phenom X6 and FX 8-cores in pretty much everything.

Indeed, it's 18 vs. 16, but you are comparing a Hex Core chip, to the fastest Bulldozer (ie. 4 x "CMT").

Still (as you said), ALU number alone can't give us the actual Integer performance.
For example, utilisation of all of the execution units can be a problem (if OoO engine isn't big enough to take advantage of additional back end resources).

PS. Since Haswell, Intel has four ALU units per core : LINK

Reply 50 of 57, by Scali

Posted on 2016-07-01, 09:46

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

agent_x007 wrote:
Indeed, it's 18 vs. 16, but you are comparing a Hex Core chip, to the fastest Bulldozer (ie. 4 x "CMT").

That was exactly my point: Bulldozer has 8 threads, but the 6-thread Phenom has more ALUs anyway.
So the statement of "because it has more ALU resources (ie. eight actual Integer threads)" is invalid. Bulldozer has more integer threads, but less ALU resources than Phenom.

agent_x007 wrote:
PS. Since Haswell, Intel has four ALU units per core : LINK

Yup, which only proves my first point more: ALU resources are not necessarily related to number of threads. I just thought I'd look at earlier Core i7s, from the Phenom/Bulldozer era. They prove the second point, that less ALU resources on paper can still perform better, depending on how the CPU makes use of these ALUs.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 51 of 57, by agent_x007

Posted on 2016-07-01, 10:25

agent_x007 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1691
Joined: 2016-01-19, 11:06

How is "because it has more ALU resources (ie. eight actual Integer threads)" statement invalid ?
I added "(ie. eight actual integer thread)" part to clarify that.
I use additional thread count (compared to normal Quad Core), as a "ALU resource" (I know, I'm not good at wording myself sometimes).

Reply 52 of 57, by Scali

Posted on 2016-07-01, 10:28

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

agent_x007 wrote:
How is "because it has more ALU resources (ie. eight actual Integer threads)" statement invalid ?

Because a thread is not an ALU resource. ALU units are ALU resources. And as we've discussed before, different CPUs have different numbers of ALU units per thread.
This is also exactly why Bulldozer is so bad: AMD added threads *at the cost of ALUs*. The result is that each individual thread is very slow. And because scaling software to 8 threads does not work that well in practice, it loses to CPUs with less threads, but more ALUs.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 53 of 57, by agent_x007

Posted on 2016-07-01, 10:42

agent_x007 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1691
Joined: 2016-01-19, 11:06

Well I described that additional thread as "additional ALU resource", because it contains ALU units and can only do Integer part of it's job ("Float part", goes to shared FPU).
If Bulldozer would have been made like Deneb/Thuban, it wouldn't have that second Integer only thread with it's ALU's (if I remember correcly, it's possible to switch second thread off in BIOS/UEFI).
May not be best way of putting it, but that's how I see it 😀

Reply 54 of 57, by Scali

Posted on 2016-07-01, 10:57

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Problem there is that hardware threads don't work like "build it and they will come". Software needs to explicitly create extra threads and divide the workload in order to use extra threads for integer or any other kind of instructions. These extra threads also result in extra overhead for creation, synchronization etc.
Since these threads are sharing some resources, software also needs to be aware of this, else they try to schedule extra FPU work threads, which the hardware can't handle.
Which is why generally less hardware threads with more ALUs will perform better.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 55 of 57, by agent_x007

Posted on 2016-07-01, 11:00

agent_x007 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1691
Joined: 2016-01-19, 11:06

^True.

Reply 56 of 57, by Carlos S. M.

Posted on 2016-07-07, 01:56

Carlos S. M. Offline

Rank Oldbie

Rank: Oldbie
Posts: 725
Joined: 2016-05-25, 17:01
Location: Canary Islands, Spain

Then imagine a Socket 478 P4 vs modern games. Basically all software should run on a P4 class CPU at least since they are the oldest arch with SSE2 instructions. The Pentium III and the Athlon XP can't run some of the newer games thanks to the SSE2 requirement. I haven't really saw a game/program which strictly needs above SSE2

What is your biggest Pentium 4 Collection?
Socket 423/478 Motherboards with Universal AGP Slot
Socket 478 Motherboards with PCI-E Slots
LGA 775 Motherboards with AGP Slots
Experiences and thoughts with Socket 423 systems

Reply 57 of 57, by agent_x007

Posted on 2016-07-07, 06:11

agent_x007 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1691
Joined: 2016-01-19, 11:06

I think sse2 push is for speed.
Sure, you can make sse2 code into sse1, but it may be unusable because of execution speed.
Also, testing modern programs on p3's and athlon xp's will be "a pain" from devs standpoint.

I know that steam requires sse2, but you should be able to overcome that limitation (I read about it somewhere...).

Main menu