Is the performance of notify_one() really this bad?

Question

For the measurement below I've been using x86_64 GNU/Linux with kernel 4.4.0-109-generic #132-Ubuntu SMP running on the AMD FX(tm)-8150 Eight-Core Processor (which has a 64 byte cache-line size).

The full source code can be obtained here: https://github.com/CarloWood/ai-threadsafe-testsuite/blob/master/src/condition_variable_test.cxx

which is independent of other libraries. Just compile with:

g++ -pthread -std=c++11 -O3 condition_variable_test.cxx

What I really tried to do here is measure how long it takes to execute a call to notify_one() when one or more threads are actually waiting, relative to how long that takes when no thread is waiting on the condition_variable used.

To my astonishment I found that both cases are in the microsecond range: when 1 thread is waiting it takes about 14 to 20 microseconds; when no thread is waiting it takes apparently less, but still at least 1 microsecond.

In other words, if you have a producer/consumer scenario and every time there is nothing to do for the consumer you let them call wait(), and every time something new is written to the queue by a producer you call notify_one() assuming that the implementation of std::condition_variable will be smart enough not to spend a lot of time when no threads are waiting in the first place.. then oh horror, your application will become a lot slower than with the code that I wrote to TEST how long a call to notify_one() takes when a thread is waiting!

It seems that the code that I used is a must to speed up such scenarios. And that confuses me: why on earth isn't the code that I wrote already part of std::condition_variable ?

The code in question is, instead of doing:

// Producer thread:
add_something_to_queue();
cv.notify_one();

// Consumer thread:
if (queue.empty())
{
  std::unique_lock<std::mutex> lk(m);
  cv.wait(lk);
}

You can gain a speed up of a factor of 1000 by doing:

// Producer thread:
add_something_to_queue();
int waiting;
while ((waiting = s_idle.load(std::memory_order_relaxed)) > 0)
{
  if (!s_idle.compare_exchange_weak(waiting, waiting - 1, std::memory_order_relaxed, std::memory_order_relaxed))
    continue;
  std::unique_lock<std::mutex> lk(m);
  cv.notify_one();
  break;
}

// Consumer thread:
if (queue.empty())
{
  std::unique_lock<std::mutex> lk(m);
  s_idle.fetch_add(1, std::memory_order_relaxed);
  cv.wait(lk);
}

Am I making some horrible mistake here? Or are my findings correct?

Edit:

I forgot to add the output of the bench mark program (DIRECT=0):

All started!
Thread 1 statistics: avg: 1.9ns, min: 1.8ns, max: 2ns, stddev: 0.039ns
The average time spend on calling notify_one() (726141 calls) was: 17995.5 - 21070.1 ns.
Thread 1 finished.
Thread Thread Thread 5 finished.
8 finished.
7 finished.
Thread 6 finished.
Thread 3 statistics: avg: 1.9ns, min: 1.7ns, max: 2.1ns, stddev: 0.088ns
The average time spend on calling notify_one() (726143 calls) was: 17207.3 - 22278.5 ns.
Thread 3 finished.
Thread 2 statistics: avg: 1.9ns, min: 1.8ns, max: 2ns, stddev: 0.055ns
The average time spend on calling notify_one() (726143 calls) was: 17910.1 - 21626.5 ns.
Thread 2 finished.
Thread 4 statistics: avg: 1.9ns, min: 1.6ns, max: 2ns, stddev: 0.092ns
The average time spend on calling notify_one() (726143 calls) was: 17337.5 - 22567.8 ns.
Thread 4 finished.
All finished!

And with DIRECT=1:

All started!
Thread 4 statistics: avg: 1.2e+03ns, min: 4.9e+02ns, max: 1.4e+03ns, stddev: 2.5e+02ns
The average time spend on calling notify_one() (0 calls) was: 1156.49 ns.
Thread 4 finished.
Thread 5 finished.
Thread 8 finished.
Thread 7 finished.
Thread 6 finished.
Thread 3 statistics: avg: 1.2e+03ns, min: 5.9e+02ns, max: 1.5e+03ns, stddev: 2.4e+02ns
The average time spend on calling notify_one() (0 calls) was: 1164.52 ns.
Thread 3 finished.
Thread 2 statistics: avg: 1.2e+03ns, min: 1.6e+02ns, max: 1.4e+03ns, stddev: 2.9e+02ns
The average time spend on calling notify_one() (0 calls) was: 1166.93 ns.
Thread 2 finished.
Thread 1 statistics: avg: 1.2e+03ns, min: 95ns, max: 1.4e+03ns, stddev: 3.2e+02ns
The average time spend on calling notify_one() (0 calls) was: 1167.81 ns.
Thread 1 finished.
All finished!

The '0 calls' in the latter output are actually around 20000000 calls.

For one thing, you don't need to lock a mutex before calling `notify_one()`. It's a pessimization: the thread being woken up is immediately blocked again, waiting for the notifier to release the mutex. — Igor Tandetnik, Feb 22 '18 at 21:35
You do need that lock in this case or it is possible that the call to notify_one() gets lost: a consumer thread could increment s_idle and because of that, but before it entered wait(), a producer thread could call notify_one() which wouldn't do anything because there isn't any thread waiting. Then the thread that incremented s_idle would enter wait(). What I want is that s_idle really reflects the exact number of threads that are waiting (that can potentially be woken up). That's what the mutex is for. Note that if I leave the mutex out the timings don't change, it's not causing a delay. — Carlo Wood, Feb 23 '18 at 00:32
Hi, did you ever find an answer to your question? I also noticed that condition_variable seemed to be slower than expected in my application. — thc, Apr 01 '19 at 07:01
@thc No, no answers here - and I haven't investigated it myself any further either. — Carlo Wood, Apr 01 '19 at 15:50

Is the performance of notify_one() really this bad?

0 Answers0