hyper threading is only slower \ VOGONS

hyper threading is only slower

Topic actions

First post, by emosun

Posted on 2017-10-25, 20:37

emosun Offline

Rank Oldbie

Rank: Oldbie
Posts: 554
Joined: 2015-02-24, 17:28

I'm playing with a dl580 g2 server with 4x xeon mp 2ghz cpus. the xeon MP was one of the first cpu's to support hyperthreading.

Ive tested the cpus with hyperthreading enabled and disabled and with hyperthreading enabled they are about 16% slower in total. I used the current version of passmark as well as the older 6.1 and they are slower in every aspect of the test. I've also tried windows server 2003 and 2008 in case it was an issue with the OS.

also , every cpu core and thread shows up correctly , and i've monitored the core speeds during the test to make sure they stay the same.

Was early hyperthreading just not very good? I was expecting hyperthreading to ADD a tiny smidge of speed instead of hinder it by 16%.

that being said the machine scored a 700 for the cpu which is pretty dam good for a 2003 machine.

Reply 1 of 16, by cyclone3d

Posted on 2017-10-25, 21:14

cyclone3d Offline

Rank l33t++

Rank: l33t++
Posts: 8184
Joined: 2015-04-08, 06:06
Location: Huntsville, AL USA

It really depends on the program as to whether or not Hyperthreading helps or hurts.

If you have Hyperthreading enabled, from my testing, it effectively cuts the available CPU cache in half for each thread.

Now if you have a program that either doesn't fill up the cache or doesn't use a lot of the same data over and over again, it should be faster provided that the program you are running is multithreaded.

the absolute maximum performance gain when using Hyperthreading is going to be about 20%. And that is only if you have a program that is specifically written to be able to use Hyperthreading efficiently.

Basically what Hyperthreading does it add an extra virtual CPU core in order to try to keep the pipeline full.

Yamaha modified setupds and drivers
Yamaha XG repository
YMF7x4 Guide
Aopen AW744L II SB-LINK

Reply 2 of 16, by emosun

Posted on 2017-10-25, 22:14

emosun Offline

Rank Oldbie

Rank: Oldbie
Posts: 554
Joined: 2015-02-24, 17:28

cyclone3d wrote:
It really depends on the program as to whether or not Hyperthreading helps or hurts. […]
Show full quote
It really depends on the program as to whether or not Hyperthreading helps or hurts.

If you have Hyperthreading enabled, from my testing, it effectively cuts the available CPU cache in half for each thread.

Now if you have a program that either doesn't fill up the cache or doesn't use a lot of the same data over and over again, it should be faster provided that the program you are running is multithreaded.

the absolute maximum performance gain when using Hyperthreading is going to be about 20%. And that is only if you have a program that is specifically written to be able to use Hyperthreading efficiently.

Basically what Hyperthreading does it add an extra virtual CPU core in order to try to keep the pipeline full.

yeah , and I'm saying that in this instance it would appear to be slower across the board. these benchmarks are very compatible with hyper threading and i made sure to use a more period correct version as well to ensure it wasn't the software itself

Reply 3 of 16, by cyclone3d

Posted on 2017-10-25, 23:25

cyclone3d Offline

Rank l33t++

Rank: l33t++
Posts: 8184
Joined: 2015-04-08, 06:06
Location: Huntsville, AL USA

My guess is that it is thrashing the CPU cache when Hyperthreading is enabled.

Yamaha modified setupds and drivers
Yamaha XG repository
YMF7x4 Guide
Aopen AW744L II SB-LINK

Reply 4 of 16, by Ozzuneoj

Posted on 2017-10-26, 00:42

Ozzuneoj Offline

Rank l33t

Rank: l33t
Posts: 3433
Joined: 2016-03-16, 21:33
Location: USA

I used to swear by passmark but I have been seeing lots of situations lately where it is quite far off from real world performance. This is mostly with graphics cards, but CPUs can certainly be off as well.

I'd recommend using something you see used on hardware review sites. Like Cinebench or something like that.

Also, I think it'd be interesting to see such a system running Windows 10 to compare an ancient many-threaded system to a modern one in modern use cases. With tablet and phone SoCs moving toward tons of slower cores and operating systems (and games) being able to make some use of that (though per-core performance is still king in gaming), I bet that system would perform admirably. I doubt it has AGP or PCI-E though, so you'd need the most modern PCI video card you could find.

Now for some blitting from the back buffer.

Reply 5 of 16, by emosun

Posted on 2017-10-26, 01:07

emosun Offline

Rank Oldbie

Rank: Oldbie
Posts: 554
Joined: 2015-02-24, 17:28

Ozzuneoj wrote:
I used to swear by passmark but I have been seeing lots of situations lately where it is quite far off from real world performance.

i havent had issues with it with any version

Ozzuneoj wrote:
I doubt it has AGP or PCI-E though, so you'd need the most modern PCI video card you could find.

I can just add a pci-e slot and bus to the machine if needed

Reply 6 of 16, by derSammler

Posted on 2017-10-26, 06:37

derSammler Offline

Rank l33t

Rank: l33t
Posts: 3886
Joined: 2017-04-14, 11:30

Keep in mind that HT does not really give you more processing power. It will just use idle/waiting time to work on a second task - making the CPU more effective (simply speaking). If a software, like a benchmark tool, already maxes out the CPU, the overhead of HT can indeed hit performance a little.

Reply 7 of 16, by amadeus777999

Posted on 2017-10-26, 14:49

amadeus777999 Offline

Rank Oldbie

Rank: Oldbie
Posts: 1046
Joined: 2013-07-04, 17:04

This might be an interesting read... http://www.agner.org/optimize/blog/read.php?i=6&v=t

Reply 8 of 16, by noshutdown

Posted on 2017-10-27, 02:13

noshutdown Offline

Rank Oldbie

Rank: Oldbie
Posts: 1331
Joined: 2010-07-23, 17:04
Location: China

ht may be slower if the benchmark doesn't use all your logical processors.
say, your rig has 4-way xeonmps with ht disabled. assume a benchmark supports 4 threads max, all 4 cores would be running at full load and everything is right.
with ht enabled, each physical core is divided into 2 logical cores, and you have 8 logical cores in total. since the benchmark uses only 4 cores, its very likely to be assigned to 4 logical cores which is actually provided by 2 physical cores, so only 2 of your 4 cores are running at full load.

Reply 9 of 16, by dexvx

Posted on 2017-10-27, 18:58

dexvx Offline

Rank Oldbie

Rank: Oldbie
Posts: 725
Joined: 2017-03-07, 03:32
Location: USA

On most scenarios with early Xeon MP's, hyperthreading is a slight performance penalty. It also heavily depends on the amount of L2/L3 cache the CPU has. I see from a wiki that a Xeon MP 2.0GHz is Gallatin with 512K L2 and 1M or 2M L3. Since Intel's cache hierarchy is not mutually exclusive (the same data from L2 is duplicated on L3), the jump from 1M to 2M L3 is actually a 3x increase in cache availability (512K to 1.5M).

Most of the cache (double L1, large slower L2) and pipeline changes in Prescott was to address Northwood HT issues (along with some other proprietary branch prediction improvements and the core HT itself).

Edit: 400 FSB shared between 4 cores didn't help the situation either. Xeon MP's were quite literally starved. It's amazing it took years to move to a STAR topology, despite the fact that it has been a huge problem since 4-way Pentium-3 Xeon. Probably some idiot PE who thought his designs on paper were the best, and subsequently ignored real world, practical data.

Reply 10 of 16, by Bobolaf

Posted on 2017-10-28, 20:50

Bobolaf Offline

Rank Newbie

Rank: Newbie
Posts: 60
Joined: 2017-06-19, 19:03

A big problem you get is that more cores be them logical or physical. This video shows quite well how more cores and equate to less performance. https://www.youtube.com/watch?v=PVl8Eupbr_E In general the more physical cores you have the less likely it is that hyperthreading will help. Like other people have said the specific program your using will have a massive effect. I used to do SETI@home the original one that is and hyperthreading made a massive difference. If I recall correctly it was something like 50% gain in productivity but more often than not it was only single digit % improvements for multi core optimised programs. This is a test on Quake 3 for example http://img.clubic.com/00055342-photo-p4ht-quake-3.jpg

Last edited by Bobolaf on 2017-10-28, 21:15. Edited 1 time in total.

Reply 11 of 16, by Scali

Posted on 2017-10-28, 21:03

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

I don't think using synthetic tests like PassMark would give a very accurate representation of what HyperThreading can do for you in real applications.
As long as you use a HyperThreading-aware OS (that is Windows XP or newer), HT can go from a small performance hit worst case to 20-30% in good cases, and IBM has even reported 50+% gains in extreme cases.
You'll have to measure the applications you actually want to run on the system, to decide whether or not it's useful to turn HT on.
I personally leave it on at all times, because for me, in most cases it gives small gains. Besides, HT is a pretty cool technology, especially those early versions, pretty much the reason to play with old P4/Xeon CPUs 😀

As for dividing caches and everything... Not really.
What happens is that enabling HT divides the out-of-order buffer of each core into two buffers, one for each thread. Since the P4 had a very large OOO-window anyway, even half the buffer is generally good enough to get the same single-core performance.

As for cache, it is set-associative, it doesn't really 'belong' to any thread.
If you are running two completely independent threads, then yes, effectively the cache will be 'split'... how the split goes, depends on the associativity: if one of the threads only accesses the same block of 4k all the time, for example, then at most only that 4k will go to that thread, and the rest of the cache can be used by the other thread. This may lead to worst-case scenarios though, since the two threads may be competing for cache.

If you are running a parallelized algorithm, where two threads are both working on related data, then the cache behaves like a shared cache of the two cores. This can lead to a best-case scenario, since any data that is cached by one thread is implicitly available to the other thread as well.

Here's an old demo I made: https://www.dropbox.com/s/71aavpktpw1e1vh/Prosaic.zip?dl=0
I originally optimized the Marching Cubes algorithm it uses for the blobs for the Core2 Duo and its shared L2-cache.
But I found that when I ran it on a single-core P4 with HT enabled, it would run about 25-30% faster than with HT disabled.
You could try, it may show similar gains on your system.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 12 of 16, by dexvx

Posted on 2017-11-02, 23:22

dexvx Offline

Rank Oldbie

Rank: Oldbie
Posts: 725
Joined: 2017-03-07, 03:32
Location: USA

I've contributed to performance white papers and suffice to say, unless your workload is an exact duplicate, you should take it with extreme prejudice. By the time a whitepaper is published publicly, most meaningful engineering data has already been scrubbed clean (unless it conformed to marketing's lofty goals). However, the configuration options and methodology are useful for an end-user to re-run the tests with their own workloads.

That said, I'm trying to find early Netburst era (e.g. pre-Prescott) whitepapers and online reviews, and unfortunately, seems like a lot of them are gone or the images/graphs are missing.

I did come across this IBM whitepaper:

https://www.ibm.com/developerworks/library/l-htl/index.html

Immediate red flag. They are using Xeon MP in a single CPU configuration. Which is ridiculously not real-world (end users do not pay a massive premium for a 4S system to use 1CPU). Based on my experience with marketing magic, the original engineering team probably tested 1CPU, 2CPU, and 4CPU configurations. The 2CPU and 4CPU likely did not make the marketing cut off, so a decision was made to only publish the 1CPU Xeon MP results. If the 2CPU and 4CPU data were as good as the 1CPU data, there is no reason to withhold it, as it would presumably help sales and drive ASP.

Also, the reason I brought up cache being a huge problem with early Netburst (where they had small amounts of L1/L2) was that even though in a SMT scenario you don't snoop the cache of the same logical core, you can still run into potential cache contention issues. The cache itself is not SMT aware, so you may run into cases where one thread can cache starve another thread. Since the other thread is starved, it will just limp onwards to completion. In the absolute worst case scenario, both threads have completely different referenced memory.

Now moving on to early Xeon MP specific issues, their biggest problem was the bus constraints (all 4 CPU's shared the FSB to the MCH) and cache coherency. Cache matters more in an MP scenario (unlike in SP), as you may need to keep coherency on up to 4 processors. In fact, as I recall, a significant amount of the FSB bandwidth was wasted just on snooping for the early Xeons. This in turn, leaves a less than desired amount for memory access. In fact, generally the worst performers for HT enabled are memory bandwidth intensive applications. I don't think this was completely resolved until Intel went to NUMA on Nehalem.

Reply 13 of 16, by emosun

Posted on 2017-11-03, 02:17

emosun Offline

Rank Oldbie

Rank: Oldbie
Posts: 554
Joined: 2015-02-24, 17:28

Scali wrote:
I don't think using synthetic tests like PassMark would give a very accurate representation of what HyperThreading can do for you in real applications.

normally i'd agree , but passmark is a very hyperthreading compatible benchmark and scales just fine on other hyperthreading systems.

I'm more inclined to agree that they simply are not designed to all be pushed all at once like they are.

I also noticed , with every cpu added to the benchmark , they yield diminishing results. Meaning if one cpu alone scores a 250 , adding all 4 yields a 700. It's quite clear they are having to share or fight for resources when all together are taxed.

Reply 14 of 16, by Scali

Posted on 2017-11-03, 15:45

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

emosun wrote:
I also noticed , with every cpu added to the benchmark , they yield diminishing results. Meaning if one cpu alone scores a 250 , adding all 4 yields a 700. It's quite clear they are having to share or fight for resources when all together are taxed.

That's what's known as Amdahl's law.
Scaling is never quite 100%, and you can only get reasonably close to 100% if there are no bottlenecks whatsoever.
In any practical multi-CPU/multi-core system, you always have to share caches, memory controllers, disk I/O and such.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 15 of 16, by gdjacobs

Posted on 2017-11-03, 21:23

gdjacobs Offline

Rank l33t++

Rank: l33t++
Posts: 7655
Joined: 2015-11-03, 05:51
Location: The Great White North

However, Amdahl's Law can generally be worked around if the working set is grown along with the parallel compute resources on hand. This is why all HPC machines these days are MPP architectures.
https://en.wikipedia.org/wiki/Gustafson%27s_law

All hail the Great Capacitor Brand Finder

Reply 16 of 16, by Matth79

Posted on 2017-11-04, 00:14

Matth79 Offline

Rank Oldbie

Rank: Oldbie
Posts: 763
Joined: 2014-05-19, 14:24

The worst possible result, is where the scheduler is not HT aware, so that on a quad with HT disabled, a 4 thread workload is one per core, but if HT is enabled, it may use threads on the same core.

The gain from HT is normally greater on single or dual cores

Go to top of page Go to top of page

Back to General Old Hardware