Why is it that restricting multithreaded applications to one core make it run faster?

Question

I have a native multithreaded Win32 application written in C++ which has about 3 relatively busy threads and 4 to 6 threads that don't do that much. When it runs in a normal mode total CPU usage adds up to about 15% on an 8-core machine and the application finished in about 30 seconds. And when I restrict the application to only one core by setting the affinity mask to 0x01 it completes faster, in 23 seconds.

I'm guessing it has something to do with the synchronization being cheaper when restricted to one physical core and/or some concurrent memory access issues.

I'm running Windows 7 x64, application is 32-bit. The CPU is Xeon X5570 with 4 cores and HT enabled.

Could anyone explain that behavior in detail? Why that happens and how to predict that kind of behavior ahead of time?

Update: I guess my question wasn't very clear. I would like to know why it gets faster on one physical core, not why it doesn't get above 15% on multiple cores.

Maybe cores were using separate storage areas for the same variables, when you get to single core, it did not need to merge the separate areas into original. — huseyin tugrul buyukisik, Sep 07 '12 at 15:10
Probably L1 and L2 cache paying dividends on the same core and your data set being reasonably small to fit in there. Maybe someone more experienced in performance can second that. — Jon, Sep 07 '12 at 15:13
Possibly much less lock contention when you force all the threads to run on the same core. If you're three threads are all busy and contention-free on an 8-core machine, you should see something like 37% CPU usage. The fact that you're seeing 15% suggests that either they aren't really busy or that they keep getting stuck waiting at a lock. — Adrian McCarthy, Sep 07 '12 at 16:10
Adrian, I should only see 37% if there's enough work to do for all these threads. One of the threads is a UI thread and doesn't always have work to do. Another one only starts doing something when the results from the third thread are ready. I'd still expect more than 15%, but that was not what I was interested in. I wanted to find out what due to what it becomes faster and not why it's not going above 15%. — detunized, Sep 07 '12 at 19:37
There's a good chance that the reason CPU isn't going above 15% in the multi-core scenario is also the reason it's running slower in that scenario, and it may be easier to debug. I'm no expert, but I don't think memory concurrency is enough to explain that much of a difference, and the synchronization primitives (if used properly) are no more efficient when confined to a single core, so I think you've got some sort of bug or design fault; for example, the less busy threads could be hogging a lock, or you could be unnecessarily waking multiple threads at once. — Harry Johnston, Sep 10 '12 at 03:22
I don't think it's possible to be much more specific without seeing the code, or at least knowing significantly more detail about the nature of the workflow. — Harry Johnston, Sep 10 '12 at 03:22
You're either blocking a lot on locks (which is expensive) or accessing close memory areas from different cores (which bounces cache lines). You need to post some code (at least the inner loops) or do some profiling. You need to use a sampling-based profiler, like intel vtune or perf on linux. — cdleonard, Sep 11 '12 at 12:11
Without code it's almost impossible to tell. It could be many things. — Tudor, Sep 11 '12 at 12:18
You realize that "why is it faster on a single core" and "why is it slower on multiple cores" are logically equivalent questions, right? — Harry Johnston, Sep 11 '12 at 22:02
Harry, yes, these questions are equivalent. But the answers I got (some of them are deleted now) have been exploring a different problem, the one why it doesn't go above 15% when on multiple cores. The answer to that one I know. — detunized, Sep 12 '12 at 00:00
You say it is `an 8-core machine` but a single socket X5570 is only a 4-core machine, hyper-threading allows two threads to share a core so you can have 8 threads running on 4 cores simultaneously. Is the machine a single or dual socket system? — amdn, Sep 12 '12 at 15:46
When you don't restrict the application to only one core, are you setting the affinity mask for each thread or letting the O/S scheduler do the assignment? — amdn, Sep 12 '12 at 19:20
PuraVida, OS does everything. And, yes, I said `The CPU is Xeon X5570 with 4 cores and HT enabled`. — detunized, Sep 13 '12 at 10:46

score 7 · Answer 1 · answered Sep 11 '12 at 12:25

The question is extremely vague, so just some random guesses based on typical threading problems.

An obvious candidate is contention, the threads fighting over a lock and in effect running serial instead of parallel. You'll end up paying for the thread context switches and gaining no benefit. This is a problem that's easy to miss in C++, there's a lot of low-level locking going on in the CRT and the C++ standard library. Both originally designed without any regard for threading.

A problem that's common on cpu cores with a strong memory model, like x86 and x64, is "false sharing". It occurs when multiple threads update memory locations that are within the same L1 cache line. The processor then spends a lot of horse power keeping the core caches synchronized.

You only gain a benefit from multiple execution cores if the program is actually execution bound. You cannot get a benefit if its memory bound. Your machine still has only one memory bus and its a strong bottleneck if the data you manipulate cannot fit the cpu caches. The cores will simply stall, waiting for the bus to catch up. It is still counted as cpu time, so won't be visible in cpu usage statistics, but little real work is getting done.

Clearly you'll need a good profiler to chase these kind of problems.

score 2 · Accepted Answer · answered Sep 17 '12 at 14:14

Without stating the application it is difficult to just guess what is causing the slow running of the application. If you want to go for a detailed analysis, we can consider following factors -

InterProcessor Communication : How much the threads in your application communicate with each other. If they communicate very often, then you will have overhead due to this behavior
Processor Cache Architecture : This is another important factor to see. You should know how the caches of the processor are going to be affected due to threads running on different processor. How much thrashing is going to happen at shared caches.
Page Faults : Maybe running on single processor is causing less number of page faults due to sequential nature of your program?
Locks : Lock overheads in your code? This should not cause a slowdown. But in addition to the above mentioned factors, this might add up to some overhead.
NoC on the processor : Definitely, if you allocate different threads to different processor cores, and they are communicating, then you need to know what is the path they are taking. Is there a dedicated connection between them? Perhaps you should have a look at this link.
Processor Load : Last but not the least is that, I hope you are not having other tasks running on other processor cores, causing a lot of context-switches. Context switch is typically very expensive.
Temperature : One effect you should consider is of the processor clock being slowed down if the cpu core is heating up. I think, you will not have this effect, but it also largely depends on the ambient temperature.

Raj, that's a lot of possibilities. I don't know which one (or a combination) of them causes the effect, but it's not very important for me in this case. Thanks for ideas. — detunized, Sep 17 '12 at 17:30

score 2 · Answer 3 · answered Sep 17 '12 at 14:21

It's almost certainly to do with caching, given the huge effect memory latency has on performance.

By being on a single core, the first and second level caches are kept particularly hot - much more so than when you're spreading over multiple cores.

The third level cache will be shared between all cores, so it won't be any different, but it is of course a lot slower, so you gain a lot by moving locality to the first and second level caches.

score 2 · Answer 4 · answered Sep 17 '12 at 15:12

"When it runs in a normal mode total CPU usage adds up to about 15% on an 8-core machine"

The only 15% usage suggests me another possible explanation: don't your threads do I/O? My guess is that the I/O operations determine the overall time of your application and not the CPU usage. And in most cases I/O intensive apps become slower when the I/O jobs are multithreaded (just think about copying two files at the same time vs one after the other).

score 1 · Answer 5 · edited Sep 23 '12 at 17:54

As far as the problem is concerned, the threads communicate between each other while running on multiple cores resulting in relatively slower process execution speed. Whereas limiting the thread to a single physical core doesn't require any inter-communication between the threads, therefore the process speeds up.

This may also be dependent on the tasks being performed: if the threads require low resources, this may be true, otherwise limiting the physical cores to one core may not be fruitful in all the cases.

Why is it that restricting multithreaded applications to one core make it run faster?

5 Answers5

Linked