Thread ID reuse between std::thread and tbb::task_group causing deadlock in OpenMP

Question

*** UPDATE: changing code to a real case that reproduces the problem ***

I'm working in some preexisting code that uses a number of multi-threading techniques; std::thread, plus Intel TBB TaskGroup, plus OpenMP. It looks like I've hit a race condition in std::thread join, and potentially one with OpenMP as well. (But of course those libraries were written by smart people, so if there's a bug in the code I'm working with, I hope you can help me figure it out.)

The scenario is that the main thread kicks off a bunch of I/O worker std::threads, which themselves initiate some tasks, and the tasks have some segments of code that use OpenMP for parallelism. The main thread does std::thread::join() to wait for the std::threads, then tbb::TaskGroup::wait() to wait for the tasks to complete.

#include <Windows.h>
#include <tbb/task_group.h>
#include <tbb/concurrent_vector.h>
#include <iostream>
#include <sstream>
#include <thread>

void DoCPUIntensiveWork(int chunkIndex);

int main ()
{
    unsigned int hardwareConcurrency = 64;
    tbb::concurrent_vector<std::shared_ptr<std::thread>> ioThreads;
    tbb::task_group taskGroup;

    wprintf(L"Starting %u IO threads\n", hardwareConcurrency);
    for (unsigned int cx = 0; cx < hardwareConcurrency; ++cx)
    {
        ioThreads.push_back(std::shared_ptr<std::thread>(new std::thread([&taskGroup, cx]
            {
                wprintf(L"IO thread %u starting\r\n", GetCurrentThreadId());
            
                // Not doing any actual IO

                taskGroup.run([cx]
                    {
                        wprintf(L"CPU task %u starting\r\n", GetCurrentThreadId());
                        DoCPUIntensiveWork(cx);
                        wprintf(L"CPU task %u done\r\n", GetCurrentThreadId());
                    });

                //Sleep(1000);  Un-commenting this will make the program terminate
                wprintf(L"IO thread %u done\r\n", GetCurrentThreadId());
            })));
    }

    // Join the IO workers
    for (std::shared_ptr<std::thread>& thread : ioThreads)
    {
        std::stringstream ss;
        ss << thread->get_id();
        wprintf(L"Wait for thread %S\r\n", ss.str().c_str());
        thread->join();  // main thread hangs here
    }

    wprintf(L"IO work complete\n");

    // And then wait for the CPU tasks
    taskGroup.wait();

    wprintf(L"CPU work complete\n");

    return 0;
}

And the CPU-Intensive work includes usage of OpenMP. (Note, result is the same if I remove the schedule(static).)

// Note: I shrunk these numbers down until the amount of work is actually
// small, not CPU-intensive at all, and it still hangs
static const int GlobalBufferChunkSize = 64;
static const int NumGlobalBufferChunks = 64;
static const int StrideSize = 16;
static const int OverwriteCount = 4;
BYTE GlobalBuffer[NumGlobalBufferChunks * GlobalBufferChunkSize];

void DoCPUIntensiveWork(int chunkIndex)
{
    BYTE* pChunk = GlobalBuffer + (chunkIndex * GlobalBufferChunkSize);

#pragma omp parallel for schedule(static)
    for (int i = 0; i < (GlobalBufferChunkSize / StrideSize); i++)
    {
        BYTE* pStride = pChunk + (i * StrideSize);
        for (int j = 0; j < OverwriteCount; j++)
        {
            memset(pStride, i, StrideSize);
        }
    }  // Task thread hangs here
}

This code hangs; the main thread waits on thread->join() forever. Even on a test case that has only a single IO job / CPU-intensive task. I added the printf's you see above, and the result showed that the IO job finished fast, that thread exited, and then the CPU-intensive task spun up with the same thread ID before the main thread even got into the join() call.

Starting 64 IO threads
IO thread 3708 starting
IO thread 23728 starting
IO thread 23352 starting
IO thread 3588 starting
IO thread 3708 done
IO thread 23352 done
IO thread 22080 starting
IO thread 23728 done
IO thread 3376 starting
IO thread 3588 done
IO thread 27436 starting
IO thread 10092 starting
IO thread 22080 done
IO thread 10480 starting
CPU task 3708 starting
IO thread 3376 done
IO thread 27436 done
IO thread 10092 done
IO thread 10480 done
Wait for thread 3708
... hang forever ...

The IO thread ID was reused for a task after the thread finished, and the thread->join() call is still sitting there waiting. When I look in the debugger, thread->join() is waiting on a thread with ID 3708, and a thread with that ID does exist, but that thread is executing the task instead of the IO work. So it appears the primary thread of the process is actually waiting for the task instead of waiting for the IO thread due to the ID reuse. (I can't find docs or code to see if std::thread::join() waits based on the ID or the handle, but it seems like it uses the ID, which would be a bug.)

Second funny thing, that task never completes, and when I look at the thread that's executing the task in the debugger, it's sitting at the end of the OpenMP parallel execution. I don't see any other threads executing parallel work. There are a number of threads from vcomp140[d].dll sitting around in ntdll.dll code, for which I don't have symbols - I presume these are just waiting for new work, not doing my task. The CPU is at 0%. I'm pretty confident nobody is looping. So, the TBB task is hung somewhere in the OpenMP multi-threading implementation.

But, just to make life complicated, the task doesn't seem to hang UNLESS the thread ID from the IO thread happens to be reused for the task. So, somewhere between std::thread and TBB tasks and OpenMP parallelism there's a race condition triggered by thread ID reuse.

I have found two workarounds that make the hang go away:

Put a Sleep(1000) at the end of the IO thread, so IO thread IDs aren't reused by the tasks. The bug is still there waiting for bad timing, of course.
Remove the use of OpenMP parallelism.

A co-worker has suggested a third potential option, to replace OpenMP parallelism with TBB parallel_for. We may do that. Of course this is all layers of code from different sources that we want to touch as little as possible.

I'm reporting this more as a possible bug report than as a cry for help.

It seems like a bug that std::thread::join() can end up waiting for the wrong thread if a thread ID is reused. It should be waiting by handle, not by ID.
It seems like there's a bug or incompatibility between TBB tasks and OpenMP, such that the OpenMP main thread can hang if it's run on a TBB task that happens to have a thread ID that was reused.

I'd assume a bug as well. One way to find out is to extract a [mcve] from the actual code. Post that in your question so that others can read it and reprocude the problem. — Ulrich Eckhardt, Feb 24 '21 at 07:26
I was afraid that would get rid of the race condition, but you're right that I should at least try. — Sue Loh, Feb 24 '21 at 16:53
Edited above to be a complete standalone exe that still reproduces the problem. You'd need to link with TBB to make it build. — Sue Loh, Feb 24 '21 at 18:44
Is `tbb::task_group` somehow syncronized? You're accessing it from different threads without further synchronization. — Ulrich Eckhardt, Feb 24 '21 at 19:09
Good question; unfortunately that's a question for Intel since they created TBB, but considering it's Threading Building Blocks I'd kind of expect it. I don't see it in the doc (https://www.threadingbuildingblocks.org/docs/help/reference/task_groups/task_group_cls.html). If that were the (only) problem, the Sleep(1000) wouldn't fix this, though. :-/ — Sue Loh, Feb 24 '21 at 19:25
Just checked the docs as well and there's no indication that it is. However, your argument is wrong: Most types are not safe to access from different threads, like e.g. `shared_ptr`, `thread` or `unique_lock`. — Ulrich Eckhardt, Feb 24 '21 at 19:27
Well you're right that the taskGroup.run() call is potentially problematic here. I don't see any other code inside the threads/tasks that looks unsafe. And that still doesn't explain why the Sleep() makes the program complete. — Sue Loh, Feb 24 '21 at 20:28
Looks like TBB task_groups are thread safe (https://stackoverflow.com/questions/23845594/how-to-multithread-tail-call-recursion-using-tbb). Also I experimented by adding a mutex lock/unlock around the taskGroup.run() call and that didn't fix anything. — Sue Loh, Feb 24 '21 at 21:19

Alex · Answer 1 · 2021-03-02T07:02:43.407

1

UPDATE: the supposition about oversubscription was incorrect. See https://github.com/oneapi-src/oneTBB/issues/353

I think the issue might be caused by OpenMP semantics. By default, it always creates as many threads as hardware concurrency.

TBB will create std::thread::hardware_concurrency() threads and OpenMP will create std::thread::hardware_concurrency() for each TBB worker thread where it is called from. I.e. in the example we will have up to std::thread::hardware_concurrency()*std::thread::hardware_concurrency() threads (+64 IO threads). If the machine is relatively big, e.g. 32+ threads, it will be 32*32 = 1024 threads in the application (overall, it is near a default ulimit or is it Windows?) In any case, such big oversubscription with OpenMP barrier semantics at the end of parallel region can lead to really big execution time (e.g. minutes or even hours).

Why does Sleep(1000) help? I am not sure but it might give some CPU resources to the system to move forward.

To check the idea, add num_threads(1) clause to #pragma omp parallel for num_threads(1) to limit the number of threads created by OpenMP runtime.

edited Mar 02 '21 at 07:02

answered Feb 26 '21 at 06:53

Alex

612
3
9

Thank you for the reply! Hmm, I just tried with num_threads(1) and the program terminated. num_threads(2) hangs. But remember that this is actually NOT a big workload - my CPU is idle, not busy. I also only have (GlobalBufferChunkSize=64) / (StrideSize=16) = 4 parallel strides of work to do with OpenMP, so I *think* there will only be 4 OpenMP threads per TBB thread anyway. I also just tried a second test with 4 IO threads, so 4 TBB tasks, with num_threads(2) so 4*2 = 8 OpenMP threads. It still hangs. It actually feels like there's a special case here where OpenMP deadlocks. – Sue Loh Feb 26 '21 at 07:29
I just edited the title to reflect the OpenMP tie - thanks again for the tip. – Sue Loh Feb 26 '21 at 07:35
As for `num_threads(2)` it seems strange, can you share the output for that case? I am not sure if SO suitable for such sort of conversations. Maybe it make sense to reach us via oneTBB GitHub (https://github.com/oneapi-src/oneTBB#engineering-team-contacts)? – Alex Feb 26 '21 at 07:57
Happy to move to TBB GitHub. Are you suggesting I e-mail the contact listed there, or open a new issue? Output for my latest case will look bad in a comment, but here goes: Starting 4 IO threads Wait for thread 10352 IO thread 10352 starting IO thread 26316 starting IO thread 25128 starting IO thread 26316 done IO thread 10352 done IO thread 7216 starting CPU task 26316 starting IO thread 25128 done IO thread 7216 done CPU task 23080 starting CPU task 12684 starting – Sue Loh Feb 26 '21 at 08:25
_Are you suggesting I e-mail the contact listed there, or open a new issue?_ -- up to you – Alex Feb 26 '21 at 14:42
OK, moved to https://github.com/oneapi-src/oneTBB/issues/353. – Sue Loh Feb 26 '21 at 18:55

score 1 · Accepted Answer · answered Feb 26 '21 at 18:55

1

Moving this issue to the TBB GitHub, at https://github.com/oneapi-src/oneTBB/issues/353

answered Feb 26 '21 at 18:55

Sue Loh

83
1
9

Thread ID reuse between std::thread and tbb::task_group causing deadlock in OpenMP

2 Answers2