C++ Producer consumer queue with (very) fast and reliable handover

Question

Hi I am looking into having thread handover using a fast and reliable producer consumer queue. I am working on Windows with VC++.

I based my design on Anthony Williams queue, that is, basically a boost::mutex with a boost::condition_variable. Now typically the time between notify_one() and waking up varies between 10 (rare) and 100 microseconds, with most values in the area of 50 microseconds. However, about 1 in a 1000 goes over one millisecond, with some taking longer than 5 milliseconds.

I was just wondering whether these are typical values? Is there a faster way of signalling short of spinning? Is it from here all down to managing thread priorities? I haven't started playing with priorities, but I was just wondering whether there is a chance of getting this into a fairly stable region of about 10 microseconds?

Thanks

EDIT: With SetPriorityClass(GetCurrentProcess(),REALTIME_PRIORITY_CLASS), the average wake time is still roughly 50 micros, however there are a lot fewer outliers, most of them are around 150-200 micros now. Except for one freak outlier at 7 ms. Hmmm... not good.

Can you use addtional 3rd party libraries? I've found that generally a lockless implementation is much more performant, and there is a good one in Intel's Thread Building Blocks. — Chad, Aug 05 '11 at 15:51
Off-topic, but maybe you'd like to check out Sutter's [wait-free queue](http://drdobbs.com/high-performance-computing/212201163), which doesn't use locks. — Kerrek SB, Aug 05 '11 at 15:53
@Chad: There should hardly be any contention on the lock, and my understanding is that boost mutex on windows is fairly cheap without contention and remains in userspace, so switching to lockless might not improve this scenario all that much... I think the issue here is more around finding the fastest way to wake the consumer thread. — Cookie, Aug 05 '11 at 15:59
@Kerrek: Thanks for the pointer, good article, but the reason I would like to avoid spinning is because the number of producer consumer queues could be a fair bit larger than the number of cores. My understanding is that this is not a good scenario to spin in? — Cookie, Aug 05 '11 at 16:01
@Cookie: indeed it's not, spinning will hinder the progress of the threads really working in this case, you could perhaps use microsleeps. — Matthieu M., Aug 05 '11 at 16:27
@Matthieu M.: Is that possible? I thought even select() had an accuracy of around 1 ms or worse on Windows... And it would only be beneficial if the thread scheduler could actually fit other threads in between the polling... — Cookie, Aug 05 '11 at 16:32
@Cookie: I don't know for Windows, exactly, for Linux a `sleep(0)` or `nanosleep` will simply stop the thread execution, and scheduling will move to another thread. It is still wasteful in that it means scheduling the thread and thus involve a context switch. — Matthieu M., Aug 05 '11 at 16:52
@Cookie: I'm not sure how the performance relates to the number of threads, though the article does provide some measurements of the throughput of the waitfree queue. I believe he says that the spinlock is acceptable because it's only waiting to perform one atomic operation, so it's hardly "locking", but I don't know what the impact of that is. On the positive side, the spinlock doesn't cause a context switch, so it may or may not be a good idea. Allegedly there are also other waitfree queues that don't spinlock at all. — Kerrek SB, Aug 05 '11 at 17:06
@Matthieu: That might be interesting, but I think I read that while sleep(0) will stop the thread executing it will not lead to lower priority threads being activated (that is if the calling thread is running fairly high prio, it would get re-scheduled straight away, effectively still starving other threads) — Cookie, Aug 05 '11 at 18:00
@Cookie: you are right about the immediate rescheduling, it is not a worry if all threads have the same priority though. — Matthieu M., Aug 05 '11 at 18:10

score 3 · Answer 1 · answered Jul 16 '15 at 04:25

One way amortise the overhead of locking and thread wakeup is to add a second queue and implement a double-buffering approach. This enables batch-processing at the consumer side:

template<typename F>
std::size_t consume_all(F&& f)
{   
    // minimize the scope of the lock
    {
        std::lock_guard<std::mutex> lock(the_mutex);
        std::swap(the_queue, the_queue2);
    }

    // process all items from the_queue2 in batch
    for (auto& item : the_queue2)
    {
        f(item);
    }

    auto result = the_queue2.size();        
    the_queue2.clear(); // clears the queue and preserves the memory. perfect!
    return result;
}

Working sample code.

This does not fix the latency issue, but it can improve throughput. If a hiccup occurs then the consumer will be presented with a large batch which can then be processed at full speed without any locking overhead. This allows the consumer to quickly catch up with the producer.

score 2 · Accepted Answer · answered Jul 05 '12 at 07:58

The short answer is yes, from there it really is down to operating system management, and thread scheduling. RTSs (Real time systems) can bring those 50 micros to about 15 micros, and more importantly, they can get rid of the outliers. Otherwise spinning is the only answer. If there are more queues than cores, the idea might be to have x number of threads spinning, to react immediately, and the remaining blocking. That would involve some kind of "master" queue thread, constantly spinning to check all queues, and - either processing items itself - or handing them over to worker threads, of which some could also be spinning to save those 50 micros. It gets complicated, though.

Probably best would be to just use a single lock-free multiple-producer-single-consumer queue with a spinning consumer thread. Then all items that go into the queues would probably need to derive from a common base type, and would need to contain some meta-info as to what to do with the items.

Complicated, but possible. If I ever set it up, I might post some code as well.

score 1 · Answer 3 · answered Aug 05 '11 at 20:18

There are lots of things that might cause problems.

You probably need to try to profile the app and see where slowdowns might be occurring.

Some notes:

Are the consumer and producer in the same process? If so, a Critical Section is much faster than a Mutex.
Try to ensure all the queue memory is in in current memory. If you have to swap pages that will slow right down.
Be very careful setting your process to real time priority. That is supposed to be for system processes. If the process does too much work, it can prevent a critical system process getting cpu, which can end very badly. Unless you absolutely need real time, just use HIGH_PRIORITY_CLASS

Indeed -- Priority belongs to the user, and to a lesser extent to the system, not to the application developer. I would even recommend going against high priority -- that can cause problems using things like Task Manager, which serves as the user's escape hatch, and which runs at high priority by default. — Billy ONeal, Aug 05 '11 at 20:21

C++ Producer consumer queue with (very) fast and reliable handover

3 Answers3