*** UPDATE: changing code to a real case that reproduces the problem ***
I'm working in some preexisting code that uses a number of multi-threading techniques; std::thread, plus Intel TBB TaskGroup, plus OpenMP. It looks like I've hit a race condition in std::thread join, and potentially one with OpenMP as well. (But of course those libraries were written by smart people, so if there's a bug in the code I'm working with, I hope you can help me figure it out.)
The scenario is that the main thread kicks off a bunch of I/O worker std::threads, which themselves initiate some tasks, and the tasks have some segments of code that use OpenMP for parallelism. The main thread does std::thread::join() to wait for the std::threads, then tbb::TaskGroup::wait() to wait for the tasks to complete.
#include <Windows.h>
#include <tbb/task_group.h>
#include <tbb/concurrent_vector.h>
#include <iostream>
#include <sstream>
#include <thread>
void DoCPUIntensiveWork(int chunkIndex);
int main ()
{
unsigned int hardwareConcurrency = 64;
tbb::concurrent_vector<std::shared_ptr<std::thread>> ioThreads;
tbb::task_group taskGroup;
wprintf(L"Starting %u IO threads\n", hardwareConcurrency);
for (unsigned int cx = 0; cx < hardwareConcurrency; ++cx)
{
ioThreads.push_back(std::shared_ptr<std::thread>(new std::thread([&taskGroup, cx]
{
wprintf(L"IO thread %u starting\r\n", GetCurrentThreadId());
// Not doing any actual IO
taskGroup.run([cx]
{
wprintf(L"CPU task %u starting\r\n", GetCurrentThreadId());
DoCPUIntensiveWork(cx);
wprintf(L"CPU task %u done\r\n", GetCurrentThreadId());
});
//Sleep(1000); Un-commenting this will make the program terminate
wprintf(L"IO thread %u done\r\n", GetCurrentThreadId());
})));
}
// Join the IO workers
for (std::shared_ptr<std::thread>& thread : ioThreads)
{
std::stringstream ss;
ss << thread->get_id();
wprintf(L"Wait for thread %S\r\n", ss.str().c_str());
thread->join(); // main thread hangs here
}
wprintf(L"IO work complete\n");
// And then wait for the CPU tasks
taskGroup.wait();
wprintf(L"CPU work complete\n");
return 0;
}
And the CPU-Intensive work includes usage of OpenMP. (Note, result is the same if I remove the schedule(static).)
// Note: I shrunk these numbers down until the amount of work is actually
// small, not CPU-intensive at all, and it still hangs
static const int GlobalBufferChunkSize = 64;
static const int NumGlobalBufferChunks = 64;
static const int StrideSize = 16;
static const int OverwriteCount = 4;
BYTE GlobalBuffer[NumGlobalBufferChunks * GlobalBufferChunkSize];
void DoCPUIntensiveWork(int chunkIndex)
{
BYTE* pChunk = GlobalBuffer + (chunkIndex * GlobalBufferChunkSize);
#pragma omp parallel for schedule(static)
for (int i = 0; i < (GlobalBufferChunkSize / StrideSize); i++)
{
BYTE* pStride = pChunk + (i * StrideSize);
for (int j = 0; j < OverwriteCount; j++)
{
memset(pStride, i, StrideSize);
}
} // Task thread hangs here
}
This code hangs; the main thread waits on thread->join() forever. Even on a test case that has only a single IO job / CPU-intensive task. I added the printf's you see above, and the result showed that the IO job finished fast, that thread exited, and then the CPU-intensive task spun up with the same thread ID before the main thread even got into the join() call.
Starting 64 IO threads
IO thread 3708 starting
IO thread 23728 starting
IO thread 23352 starting
IO thread 3588 starting
IO thread 3708 done
IO thread 23352 done
IO thread 22080 starting
IO thread 23728 done
IO thread 3376 starting
IO thread 3588 done
IO thread 27436 starting
IO thread 10092 starting
IO thread 22080 done
IO thread 10480 starting
CPU task 3708 starting
IO thread 3376 done
IO thread 27436 done
IO thread 10092 done
IO thread 10480 done
Wait for thread 3708
... hang forever ...
The IO thread ID was reused for a task after the thread finished, and the thread->join() call is still sitting there waiting. When I look in the debugger, thread->join() is waiting on a thread with ID 3708, and a thread with that ID does exist, but that thread is executing the task instead of the IO work. So it appears the primary thread of the process is actually waiting for the task instead of waiting for the IO thread due to the ID reuse. (I can't find docs or code to see if std::thread::join() waits based on the ID or the handle, but it seems like it uses the ID, which would be a bug.)
Second funny thing, that task never completes, and when I look at the thread that's executing the task in the debugger, it's sitting at the end of the OpenMP parallel execution. I don't see any other threads executing parallel work. There are a number of threads from vcomp140[d].dll sitting around in ntdll.dll code, for which I don't have symbols - I presume these are just waiting for new work, not doing my task. The CPU is at 0%. I'm pretty confident nobody is looping. So, the TBB task is hung somewhere in the OpenMP multi-threading implementation.
But, just to make life complicated, the task doesn't seem to hang UNLESS the thread ID from the IO thread happens to be reused for the task. So, somewhere between std::thread and TBB tasks and OpenMP parallelism there's a race condition triggered by thread ID reuse.
I have found two workarounds that make the hang go away:
- Put a Sleep(1000) at the end of the IO thread, so IO thread IDs aren't reused by the tasks. The bug is still there waiting for bad timing, of course.
- Remove the use of OpenMP parallelism.
A co-worker has suggested a third potential option, to replace OpenMP parallelism with TBB parallel_for. We may do that. Of course this is all layers of code from different sources that we want to touch as little as possible.
I'm reporting this more as a possible bug report than as a cry for help.
- It seems like a bug that std::thread::join() can end up waiting for the wrong thread if a thread ID is reused. It should be waiting by handle, not by ID.
- It seems like there's a bug or incompatibility between TBB tasks and OpenMP, such that the OpenMP main thread can hang if it's run on a TBB task that happens to have a thread ID that was reused.