What is the performance impact of having more OpenMP threads than work?

Question

Consider the following example where the individual jobs are independent (no synchronization needed between the threads):

#pragma omp parallel num_threads(N)
{
    #pragma omp for schedule(dynamic) nowait
    for (int i = 0; i < jobs; ++i)
    {
        ...
    }
}

If N = 4 and jobs = 3 I doubt there will be much of a performance hit to having the extra thread created and destroyed, but if N = 32 then I'm wondering about the impact for creating/destroying the unused threads. Is it something we should even worry about?

Good question, but i doubt that any new threads are created in that code, and I would guess they are only for the benefit of the local scheduler ie. limiting from max. — Surt, Oct 14 '15 at 15:58
Is there a reason why you wouldn't just do `#pragma omp parallel num_threads(std::max(N, jobs))`? If you're really worried about a performance hit, this seems the easiest way out to me. — NoseKnowsAll, Oct 14 '15 at 16:09
The question is mostly to satisfy a curiosity about how OpenMP manages this. We have run some tests to compare the wall clock times, but I think maybe someone would be able to quote something from the OpenMP standard. — RyGuyinCA, Oct 14 '15 at 16:51
I don't think the OpenMP says anything about how the threads are created and destroyed or managed. But I can tell you what I have seen from experience with GCC and MSVC. I did this be looking at the list of threads attached to the process. — Z boson, Oct 16 '15 at 09:31
The first time your code enters a parallel region it creates a team of threads equal to the [number of threads you explicitly or implicitly tell it](http://stackoverflow.com/a/22816325/2542702). The next parallel region you enter if you tell it to use more threads then it expands the pool but if you tell it to use less threads it does not decrease the pool. So there is an extra overhead to increase the pool but not decrease. The additional threads just idle. — Z boson, Oct 16 '15 at 09:33
So if you want to decrease the overhead then the first parallel region you enter should have the maximum number of threads you plan to use in your code. — Z boson, Oct 16 '15 at 09:35
Just to be clear of your question. The title seems to be about the benefits of oversubscribing but the content of your question seems to be only about the overhead of creating/destroying many more threads then will be used in a parallel region. What exactly do you want to know? — Z boson, Oct 16 '15 at 09:38
Ok, that's great to know. The threads are created in the pool which is used over the course of the program execution. Thanks. — RyGuyinCA, Oct 19 '15 at 20:32
`OMP_WAIT_POLICY=active` could be worth considering in an answer. — Z boson, May 23 '18 at 11:28

score 3 · Answer 1 · answered May 23 '18 at 22:25

First of all, the most general way to express your code is:

#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < jobs; ++i)
{
}

Assume that the Implementation has a good default.

Before you go any further, measure. Sure sometimes it can be necessary to help out the implementation, but don't do that blindly. Most of the further things are implementation dependent, so looking at the standard doesn't help you a lot.

If you still manually specify the number of threads, you might as well give it std::max(N, jobs).

Here are some things to look out that could influence the performance in your case:

Don't worry too much about overhead of spawning unnecessary threads. Implementations mitigate that by thread pools. That doesn't mean it's always perfect - so measure.
Do not oversubscribe unless you know what your are doing. Use at most number of cores threads. This is a general advice.
The OMP_WAIT_POLICY matters in your case as it defines how waiting threads behave. In your case excess threads will wait at the implicit barrier at the end of the parallel region. Implementations are free to do what they want with the setting, but you may assume that with active, threads use some form of busy waiting and with passive, threads will sleep. A busy waiting thread could use resources of the computing threads, e.g. power budget that could use used to increase turbo frequency of the computing threads. Also they waste energy. In case of oversubscription the impact of active threads is much worse.

What is the performance impact of having more OpenMP threads than work?

1 Answers1