1

In my project I have multiple threads feeding data to a worker pool. The worker threads wait, using a condition variable, for new data. This works well for hours and sometime days, but from time to time a worker blocks forever in its std::condition_variable::wait_for(). No timeout and no notify wakes it. I verified that the worker thread is still there, it just never leaves the wait_for().

The workers job usually takes 4-6 milliseconds. On average every 5 milliseconds a new job is added by pushWork(). Those new jobs are not created at regular intervals, but in bursts of 10-20 jobs. Instead of the complex real job I use a dummy with a sleep in my example code below.

Why can this sporadic blocking forever happen?

Simplified version of my code, reduced to the important parts:

#include <chrono>
#include <thread>
#include <condition_variable>
#include <mutex>
#include <vector>
#include <iostream>
#include <cstdlib>
#include <atomic>

class TestWorker
{
public:
  enum JobState
  {
    free = 0,
    processing
  };

  TestWorker() : workerThread(&TestWorker::doWork, this)
  {
  }

  ~TestWorker()
  {
    stopWorker = true;
    if (workerThread.joinable())
    {
      workerTrigger.notify_one();
      workerThread.join();
    }
  }

  void pushWork(const int data)
  {
    while (JobState::free != state)
    {
      std::this_thread::sleep_for(std::chrono::milliseconds(1));
    }
    std::lock_guard<std::mutex> lock(workerMutex);
    while (JobState::free != state)
    {
      std::this_thread::sleep_for(std::chrono::milliseconds(1));
    }
    // Set dummy data for job ...
    dummyData = data;
    state = JobState::processing;
    workerTrigger.notify_one();
  }

  void doWork()
  {
    std::unique_lock<std::mutex> triggerLock(workerMutex);
    while (!stopWorker)
    {
      if (JobState::processing == state)
      {
        // Do something for 4-6ms with dummy job data
        ++dummyData;
        std::this_thread::sleep_for(std::chrono::milliseconds(5));  // Dummy sleep instead of real job
        state = JobState::free;
      }
      else
      {
        if (!workerTrigger.wait_for(triggerLock,
                               std::chrono::milliseconds(500),
                               [this]
        { return (stopWorker || (JobState::processing == state)); }))
        {
          if (!stopWorker)
          {
            std::cout << "TIMEOUT" << std::endl;
          }
          else
          {
            std::cout << "EXIT" << std::endl;
          }
        }
      }
    }
  }

  std::mutex workerMutex;
  std::condition_variable workerTrigger;
  bool stopWorker = false;
  std::thread workerThread;
  std::atomic<JobState> state = { JobState::free };
  // Dummy data for job ...
  int dummyData = 0;
};

int main()
{
  std::cout << "START" << std::endl;

  std::vector<TestWorker> testWorkers(10);

  for (int i = 1; i <= 1000; ++i)
  {
    if (0 == i % 10)
    {
      std::cout << "Loop #" << i << std::endl;
    }
    for (auto it = testWorkers.begin(); it != testWorkers.end(); ++it)
    {
      it->pushWork(i);
    }
  }

  std::cout << "END" << std::endl;

  return 0;
}

Compiled with: g++ workerTriggerTest.cc -o workerTriggerTest -pthread -std=c++11

gcc-Version 4.8.5 20150623 (Red Hat 4.8.5-36)

CentOS Linux release 7.6.1810

Jarod42
  • 203,559
  • 14
  • 181
  • 302
SKCoder
  • 409
  • 3
  • 10
  • How can it be that it _works well for hours and sometime days_, when the shown code does just 1000 loop cycles which are finished in a few seconds? In other words: The error may not be reproducible with the shown code. – Armali Feb 03 '20 at 10:29
  • This is a very simplified extraction from a bigger project to show how I did the synchronization and the wait/trigger part. The productive code runs "forever" and does of cause meaningful calculations and has less predictable timings. – SKCoder Feb 03 '20 at 10:41
  • Two things: First, `stopWorker` should be atomic, or protected by the mutex; it's currently neither (look at your destructor). Second, your `pushWork` is sketchy. The sleep_for's make no sense, when in reality a lock, load, notify, unlock, should be part of that function, just as lock, read, process, reset, notify, should be the inner loop of doWork. That said, were I tasked with a work crew queue, not going to lie, this isn't how I'd do it, but I'm sure you have reasons. "Sporadic blocking" only appears to happen when what you think should signaled a wakeup.. didn't. Or someone else got it. – WhozCraig Feb 03 '20 at 10:43
  • 1
    Another mistake here is that `workerThread` starts before the subsequent member variables have initialized. `workerThread` should either be made the last member or be initialized in the constructor body and not in the initializer list. – Maxim Egorushkin Feb 03 '20 at 10:53
  • 1
    Not sure why your question is marked as 'duplicate' and got closed -- the main issue described in the other question is that the "stop" function holds the lock while waiting for consumer threads to join, if the consumer thread is waiting for the lock, deadlock happens. That is clearly not your case. I'd be interested in seeing the call stacks once your deadlock happens. It should be fairly straightforward to figure out why a deadlock happen from the stacktrace. – SPD Feb 03 '20 at 12:17
  • Valid concerns about the quality of the example code were raised so far, but the core question is, how can std::condition_variable::wait_for() be called and never return, despite of timeout and external notifications, even after working for hours without a problem. So I rule out any start-up or initialization issues. – SKCoder Feb 03 '20 at 12:17
  • @SKCoder,are you able reproduce the deadlock with this sample code? If so, can you share the stack trace? – SPD Feb 03 '20 at 12:24
  • I wrote this example code on how I use the synchronization just for posting this question. I have it running since then, but as I mentioned it can take days to hang up. If I get the issue with the example, I try to post the stack trace. – SKCoder Feb 03 '20 at 12:26
  • Seems I ruled out the order of the members to early. After I rearranged the member's order and initialization (plus the recommended additional atomics), the code runs without issues since about a week. Thanks all. Unfortunately I can no longer promote this as an answer, as the question was closed prematurely as a "duplicate". – SKCoder Feb 10 '20 at 08:29

0 Answers0