Why is parallel compilation performance with HT worse than without?

Question

I've made several measurements of compilation time of wine with HyperThreading enabled and disabled in BIOS on my Core i7 930 @2.8GHz (quad-core) on Linux 2.6.39 x86_64. Each measurement was like this:

git clean -xdf
./configure --prefix=/usr
time make -j$N

where N is number from 1 to 8.

Here're the results ("speed" is 60/real from time(1)):

enter image description here

Here the blue line corresponds to HT disabled and purple one to HT enabled. It appears that when HT is enabled, using 1-4 threads is slower than without HT. I guess this might be related to the kernel not distributing the processes to different cores and reusing second threads of already busy cores.

So, my question: how can I force the kernel to give 1 process per core scheduling higher priority than adding more processes to the same core's different thread? Or, if my reasoning is wrong, how can I have performance with HT not worse than without HT for 1-4 processes running in parallel?

score 3 · Answer 1 · answered Dec 15 '13 at 22:50

Hyper-threading on Intel chips is implemented as duplication of some of the elements of a pysical core but without enough electronics to be an independent core (e.g. they may share an instruction decoder but I cant recall the specifics of Intel's implementation).

Image a pysical core with HT as 1.5 physical cores that your OS sees as 2 real cores. This doesn't equate to 1.5x speed though (this can vary depending on use case)

In your example, non-HT is faster up to 4 threads because none of the cores are sharing work with their HT pipeline. You see a flatline above 4 threads because now you only have 4 execution threads and you get a little extra overhead context switching between threads.

In the HT example you are a bit slower up to 4 threads probably because some of those threads are being assigned to a real core and it's HT, so you are losing performance as those two execution threads share physical resources. Above 4 threads you are seeing the benefit of the extra execution threads, but you see the beginning of diminishing returns.

You could probably match performance on both cases for up to 4 threads, but likely not with a compilation job. To many processes being spawned for processor affinity to be setup I think. If you instead ran a real parallel job using OpenMP or MPI with X<=4 threads bound to the specific real CPU cores, I think you'd see similar performance between HT-off and -on.

The other thing to add to this is that HyperThreading allows multiple threads per core and likely causes cache contention issues. Likely other threads (possibly a LOT if you're eating up all the cores and have a lot of background processes, like if you have X Server going) get scheduled on the same physical core and evict a ton of the compiler's cache lines. (Otherwise a very nice explanation). — CrazyCasta, Nov 09 '19 at 06:11

score 0 · Answer 2 · answered Dec 15 '13 at 20:44

0

Given a number of threads <= the number of real cores, using HT should be slower because (considered crudely) you are potentially cutting the speed of your cores in half.¹

Keep in mind that generally more cores is NOT better than FASTER cores. In fact, the only reason so much work was put into developing multi-core systems is that it became increasingly difficult to make faster and faster ones. So if you cannot have a 20 Ghz processor, then 8 x 3 Ghz ones will have to do.

HT is, I believe, primarily intended as an advantage in contexts where each thread is not necessarily gobbling as much processor as it can; it's doing some particular task that's governed by interaction with a user, such as CAD stuff, video games, etc; these are the kind of applications that benefit from multi-tasking. By contrast, server platforms -- wherein the primary applications tend to thread independent tasks that are not governed by a dependence on anything else, hence are optimally run as fast as possible -- do not benefit directly from multi-tasking; they benefit from speed. make is in the same category, although with a perhaps greater degree of interdependence between threads, which is why you see an advantage for HT from 4-8 threads.

^{1. This is a simplification. HT doesn't simply double the number of cores and halve their speed, but whatever dynamic is used, the total number of processor cycles per second for the system is not improved. It's the same -- only more fragmented.}

answered Dec 15 '13 at 20:44

CodeClown42

11,194
1
32
67

Well, you seem to be trying to say that HT doesn't speed up anything. But this clearly is untrue by definition of this technology, and also contradicts the observation (see graph for threads>4, compare two curves). From my measurements it _effectively_ adds one more core although physically there are only 4 present - in cases when all 8 threads are busy working. – Ruslan Dec 15 '13 at 20:53
You're right, I interpreted the graph backward, lol, I'll edit this. But: the graph still demonstrates my general point, which is that hyper-threading doesn't -- *can't* -- increase the total number of processor cycles available. It obviously does scale dynamically, such that if you run 4 threads on a quad-core w/ HT, those 4 threads are, *ideally, more or less the same* as 4 threads w/o HT. The "ideally, more or less" is what makes the difference -- the *ideal* in that situation is 4 fast cores. You have 4 fast cores without HT, enabling it can't make it better, but could make it worse. – CodeClown42 Dec 15 '13 at 21:11
The graph shows, for hyper threading and for this task, an approximate linear speed increase as number of threads increase up to 4, then a lesser increase after that (but still an increase). I did not expect this ether, but look at the data. – ctrl-alt-delor Dec 15 '13 at 22:21
Sorry but you don't understand HT at all. Using HT with <= number of real cores has little effect because unless the kernel is brain-dead it's going to do it's best to distribute the heavy workload across cores. The slowdown is likely some combination of the kernel not quite getting it right and cache contention. Your concept of cycles/sec is nonsense, it hasn't existed on the x86 platform since the Pentium Pro and in general since the 60s. Practically all modern processors execute multiple instructions in parallel (ILP on 1 thread, not cores). HT is basically just two threads using ILP. – CrazyCasta Nov 09 '19 at 06:08

Why is parallel compilation performance with HT worse than without?

2 Answers2