Bug related to g++/OpenMP when using std::thread?

Question

I've distilled the problem I have to its bare essentials. Here is the first example piece of code:

#include <vector>
#include <math.h>
#include <thread>

std::vector<double> vec(10000);

void run(void) 
{
    for(int l = 0; l < 500000; l++) {

    #pragma omp parallel for
        for(int idx = 0; idx < vec.size(); idx++) {

            vec[idx] += cos(idx);
        }
    }
}

int main(void)
{
    #pragma omp parallel
    {
    }

    std::thread threaded_call(&run);
    threaded_call.join();

    return 0;
}

Compile this as (on Ubuntu 20.04): g++ -fopenmp main.cpp -o main

EDIT: Version: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Running on a Ryzen 3700x (8 cores, 16 threads) : run time ~43s, all 16 logical cores reported in System Monitor at ~80%.

Next take out the #pragma omp parallel directive, so the main function becomes:

int main(void)
{
    std::thread threaded_call(&run);
    threaded_call.join();

    return 0;
}

Now run time ~9s, all 16 logical cores reported in System Monitor at 100%.

I've also compiled this using MSVC on Windows 10, cpu utilization is always ~100% irrespective of the #pragma omp parallel directive being there or not. Yes I am fully aware this line should do absolutely nothing, yet with g++ it causes the above behaviour; also it only happens if calling the run function on a thread, not directly. I experimented with various compilation flags (-O levels) but problem remains. I suppose looking at the assembly code is the next step, but I can't see how this is anything but a bug in g++. Can anyone shed some light on this please? Would be much appreciated.

Furthermore, calling omp_set_num_threads(1); in the "void run(void)" function just before the loop, in order to check how long a single thread takes, gives ~70s run time with only one thread at 100% (as expected).

Further, possibly related problem (although this might be lack of understanding on my part): Calling omp_set_num_threads(1); in the "int main(void)" function (before threaded_call is defined) does nothing when compiling with g++, i.e. all 16 threads still execute in the for loop, irrespective of the bogus #pragma omp parallel directive. When compiling with MSVC this causes only one thread to run as expected - according to the documentation for omp_set_num_threads I though this should be the correct behaviour, but not so with g++. Why not, is this a further bug?

EDIT: this last problem I understand now (Overriding OMP_NUM_THREADS from code - for real), but still leaves the original problem outstanding.

Mixing OpenMP with any other threading paradigm such as POSIX threads or the C++ threading library is outside the scope of the OpenMP specification. It may work or it may not work and/or result in strange effects. In your case, it is the latter. — Hristo Iliev, Nov 24 '20 at 17:06
Calling `omp_set_num_threads(1)` in `main()` doesn't work since it only affects parallel regions encountered in the thread where the call was made. — Hristo Iliev, Nov 24 '20 at 17:10
@Hristo Iliev I understand what you're saying, and would be happy to accept this as the answer (I've since modified my project to avoid this problem). However it seems to me this is really poor specification. OpenMP is very widely used for parallelism on shared memory devices, even comes with g++, and std::thread is standard and also very widely used. It's not beyond reason the two would be used together - e.g. dispatching one or more processes to run computations, each with a set number of OpenMP threads is useful. — qshn, Nov 24 '20 at 18:05
@Hristo Iliev Yes thank you on omp_set_num_threads, also found here (https://stackoverflow.com/questions/56361293/overriding-omp-num-threads-from-code-for-real) — qshn, Nov 24 '20 at 18:09
It is not poor specification. OpenMP is a complex threading technology that relies on many low-level optimisations and hence requires full control over the application threading. The different implementations are allowed to relax this and implement interoperability with other techniques, but this is not guaranteed to be portable. Mind that OpenMP targets a very wide class of devices and implementations and requiring interoperability with C++ threading is quite restricting. — Hristo Iliev, Nov 24 '20 at 18:13
If you need a justification from a higher authority, you can ask Michael Klemm (the author of the top answer to that other question) as he is the CEO of the OpenMP ARB (Architecture Review Boards) :) — Hristo Iliev, Nov 24 '20 at 18:23
@Hristo Iliev I guess the lesson here must be never mix OpenMP and std::thread if you care about portable code. I suppose I can live with this, but this just brings up more questions.1) why would they bother making num-threads a thread-private ICV if OpenMP is not meant to be used with std::thread, why not just keep it global as with MSVC/OpenMP2.0, 2) the above example uses a single thread which runs OpenMP code, there are no issues with clashes between different threads running OpenMP, 3) g++ should not be generating any extra/different code since the "#pragma omp parallel {}" does nothing. — qshn, Nov 24 '20 at 19:27
1) OpenMP has the concept of nested parallelism. When you call `omp_set_num_threads()` from a thread in a team executing a parallel region, it affects only the number of threads in nested parallel regions encountered by that same thread (if nested parallelism is enabled) and not that of its siblings. 2/3) The constructor of `libgomp` creates a thread pool rooted in the main process thread. If you spawn a thread not from within OpenMP runtime, the thread-local pointer to the thread pool doesn't get initialised and so the library creates yet another thread pool rooted in the new thread. — Hristo Iliev, Nov 24 '20 at 23:09
The MSVC OpenMP runtime uses the thread pool implementation provided by the Win32 API. Before Vista, each process could have only a single thread pool - the default thread pool, and that runtime certainly predates (or targets Windows versions predating) Vista, which is probably the reason MSVC behaves like that. — Hristo Iliev, Nov 24 '20 at 23:16
`#pragma omp parallel {}` actually does a lot in g++. It results in a call to `GOMP_parallel()`, which requires linking with `libgomp`. `libgomp` has a constructor (ELF shared library constructor, not a C++ constructor) that sets up a lot of things way before control has been passed to `main()`. Also, threads may linger in a busy state after the end of the parallel region - that helps them wake up faster in consecutive parallel regions. — Hristo Iliev, Nov 24 '20 at 23:32

score 2 · Accepted Answer · answered Nov 25 '20 at 09:32

Thank you to Hristo Iliev for useful comments, I now understand this and would like to answer my own question in case it's of use to anyone having similar issues.

The problem is if any OpenMP code is executed in the main program thread, its state becomes "polluted" - specifically after the "#pragma omp parallel" directive, OpenMP threads remain in a busy state (all 16) and this affects the performance of all other OpenMP code in any std::thread threads, which spawn their own team of OpenMP threads. Since the main thread only goes out of scope when the program finishes, this performance issue remains for the entire program execution. Thus if using OpenMP with std::thread make sure absolutely no OpenMP code exists in the main program thread.

To demonstrate this consider the following modified example code:

#include <vector>
#include <math.h>
#include <thread>

std::vector<double> vec(10000);

void run(void) 
{
    for(int l = 0; l < 500000; l++) {

    #pragma omp parallel for
        for(int idx = 0; idx < vec.size(); idx++) {

            vec[idx] += cos(idx);
        }
    }
}

void state(void)
{
#pragma omp parallel
    {
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(5000));
}

int main(void)
{
    std::thread state_thread(&state);
    state_thread.detach();

    std::thread threaded_call(&run);
    threaded_call.join();

    return 0;
}

This code runs at 80% CPU utilization for the first 5 seconds, then runs at 100% CPU utilization for the duration of the program. This is because in the first std::thread a team of 16 OpenMP threads is spawned and remain in a busy state, thus affecting the performance of the OpenMP code in the second std::thread. As soon as the first std::thread terminates the performance of the second std::thread is not affected anymore since the second team of 16 OpenMP threads now doesn't have to compete for CPU access with the first. When the offending code was in the main thread the issue persisted until the end of the program.

Note that OpenMP 5.1 (which came out earlier in the month, so won't yet be available) has a solution to this resource sharing problem, in that it alloww you explicitly to release OpenMP resources. See https://www.openmp.org/spec-html/5.1/openmpse36.html#x201-2340003.6 — Jim Cownie, Nov 26 '20 at 10:53
You may also be able to reduce the impact of the OpenMP threads by asking them to sleep in the kernel sooner. That can be achieved by setting the envirable OMP_WAIT_POLICY=passive (vide https://www.openmp.org/spec-html/5.1/openmpse64.html#x330-5050006.7) — Jim Cownie, Nov 26 '20 at 10:59

Bug related to g++/OpenMP when using std::thread?

1 Answers1