For the measurement below I've been using x86_64 GNU/Linux with kernel 4.4.0-109-generic #132-Ubuntu SMP running on the AMD FX(tm)-8150 Eight-Core Processor (which has a 64 byte cache-line size).
The full source code can be obtained here: https://github.com/CarloWood/ai-threadsafe-testsuite/blob/master/src/condition_variable_test.cxx
which is independent of other libraries. Just compile with:
g++ -pthread -std=c++11 -O3 condition_variable_test.cxx
What I really tried to do here is measure how long it takes to execute a call to notify_one()
when one or more threads are actually waiting, relative to how long that takes when no thread is waiting on the condition_variable used.
To my astonishment I found that both cases are in the microsecond range: when 1 thread is waiting it takes about 14 to 20 microseconds; when no thread is waiting it takes apparently less, but still at least 1 microsecond.
In other words, if you have a producer/consumer scenario and every time there is nothing to do for the consumer you let them call wait()
, and every time something new is written to the queue by a producer you call notify_one()
assuming that the implementation of std::condition_variable will be smart enough not to spend a lot of time when no threads are waiting in the first place.. then oh horror, your application will become a lot slower than with the code that I wrote to TEST how long a call to notify_one()
takes when a thread is waiting!
It seems that the code that I used is a must to speed up such scenarios. And that confuses me: why on earth isn't the code that I wrote already part of std::condition_variable ?
The code in question is, instead of doing:
// Producer thread:
add_something_to_queue();
cv.notify_one();
// Consumer thread:
if (queue.empty())
{
std::unique_lock<std::mutex> lk(m);
cv.wait(lk);
}
You can gain a speed up of a factor of 1000 by doing:
// Producer thread:
add_something_to_queue();
int waiting;
while ((waiting = s_idle.load(std::memory_order_relaxed)) > 0)
{
if (!s_idle.compare_exchange_weak(waiting, waiting - 1, std::memory_order_relaxed, std::memory_order_relaxed))
continue;
std::unique_lock<std::mutex> lk(m);
cv.notify_one();
break;
}
// Consumer thread:
if (queue.empty())
{
std::unique_lock<std::mutex> lk(m);
s_idle.fetch_add(1, std::memory_order_relaxed);
cv.wait(lk);
}
Am I making some horrible mistake here? Or are my findings correct?
Edit:
I forgot to add the output of the bench mark program (DIRECT=0):
All started!
Thread 1 statistics: avg: 1.9ns, min: 1.8ns, max: 2ns, stddev: 0.039ns
The average time spend on calling notify_one() (726141 calls) was: 17995.5 - 21070.1 ns.
Thread 1 finished.
Thread Thread Thread 5 finished.
8 finished.
7 finished.
Thread 6 finished.
Thread 3 statistics: avg: 1.9ns, min: 1.7ns, max: 2.1ns, stddev: 0.088ns
The average time spend on calling notify_one() (726143 calls) was: 17207.3 - 22278.5 ns.
Thread 3 finished.
Thread 2 statistics: avg: 1.9ns, min: 1.8ns, max: 2ns, stddev: 0.055ns
The average time spend on calling notify_one() (726143 calls) was: 17910.1 - 21626.5 ns.
Thread 2 finished.
Thread 4 statistics: avg: 1.9ns, min: 1.6ns, max: 2ns, stddev: 0.092ns
The average time spend on calling notify_one() (726143 calls) was: 17337.5 - 22567.8 ns.
Thread 4 finished.
All finished!
And with DIRECT=1:
All started!
Thread 4 statistics: avg: 1.2e+03ns, min: 4.9e+02ns, max: 1.4e+03ns, stddev: 2.5e+02ns
The average time spend on calling notify_one() (0 calls) was: 1156.49 ns.
Thread 4 finished.
Thread 5 finished.
Thread 8 finished.
Thread 7 finished.
Thread 6 finished.
Thread 3 statistics: avg: 1.2e+03ns, min: 5.9e+02ns, max: 1.5e+03ns, stddev: 2.4e+02ns
The average time spend on calling notify_one() (0 calls) was: 1164.52 ns.
Thread 3 finished.
Thread 2 statistics: avg: 1.2e+03ns, min: 1.6e+02ns, max: 1.4e+03ns, stddev: 2.9e+02ns
The average time spend on calling notify_one() (0 calls) was: 1166.93 ns.
Thread 2 finished.
Thread 1 statistics: avg: 1.2e+03ns, min: 95ns, max: 1.4e+03ns, stddev: 3.2e+02ns
The average time spend on calling notify_one() (0 calls) was: 1167.81 ns.
Thread 1 finished.
All finished!
The '0 calls' in the latter output are actually around 20000000 calls.