Most efficient way of writing a single producer/single consumer queue

Question

What is the most efficient way of writing a producer/consumer queue where one thread is the producer and the other is a consumer. In one paper, the author said that it requires one atomic operation to insert an element into his queue, but he didn't explain how.

Also his queue is a circular queue and the consumer waits if queue is empty while the producer waits if queue is full. How could he have implemented such a queue. By atomic operation, did he mean some kind of mutex or just an atomic variable. Note he said, one atomic operation.

Damon · Accepted Answer · 2012-07-06T12:40:32.903

If you have no experience with implementing a lockless queue, the only answer can be: The most efficient (and safe) way is to use a lock, as provided by pthreads (mutex or cond var).

Lockless algorithms will usually (but not necessarily) give you a little extra performance, but they can go terribly wrong if you don't know exactly what you're doing.

On the other hand, the phtreads implementation under Linux avoids locks when possible and uses futex when it needs to (again, futex is something that's fast but where you should know what you're doing).
Such a queue has no trouble passing a hundred thousand tasks per second. This may seem limiting, but really if you need to pass 10 million tasks per second, you are doing it wrong. Ideally you will maybe pass a few dozen to a hundred or so tasks per second on a queue (fewer tasks, but bigger workgroups). You want to e.g. create 50 tasks working on half a megabyte of data each, not 25 million tasks working on one byte of data.

If you still insist on giving a lockless implementation a try (maybe out of academic interest), you will need an atomic compare-exchange operation (look up "legacy __sync functions" in the GCC documentation for C, for C++ you would use the new atomic ops).
Be sure to read up on subtle details like ABA, for which you usually need some kind of pointer manipulation (storing a refcount in the lowest 3 bits) or a double word exchange with an explicit reference count.

Alternatively, a lockless queue can be implemented with only atomic add or without any atomic operations at all (see "fast forward queue" if you're interested just for curiosity), if you make some assumptions. However, these only work if the assumptions hold true, and they are even more error-prone, so best stay away from them.

I'm not using a queue for tasks, but to keep data which will be produced by the producer and read by the consumer. And the data is a lot. The program could easily have millions of elements in a second. Can you guess what the author would have used. I speculate he used some kind of lockless algorithm. — pythonic, Jul 06 '12 at 12:39
A typical lockfree queue with CAS/DCAS (e.g. `__sync_val_compare_and_swap`) can handle around 700-800k msg/s on my system. For "could easily be millions" you probably need a fast forward queue because atomic ops just aren't fast enough for that. But again, if you talk of "millions", you do it wrong. Have the producer pack 500-1000 items into one message, and let the consumer process 1000 at a time. Really. This is the right way of doing it. — Damon, Jul 06 '12 at 12:45
Note that you **do not want** perfect parallelism down to one item. You want to divide work fine enough so they can be scheduled to different cores on a multi core/processor machine (or even several machines, over a network). A finer granularity than necessary is of no avail but immensely more costly. If you have, say, 10 million tasks and you pass 1000 in one batch, that's 10,000 batches -- that's way enough to guarantee a near perfect balancing. — Damon, Jul 06 '12 at 12:48
N.b.: "task" or "data" is the same, just worded generally. Even if "process" only means "keep it, save it". About balancing, if you do not want parallelism, then you can have the producer write everything to one large buffer (possibly megabytes) and hand over one buffer at a time every now and then -- you need even less granularity in this case. — Damon, Jul 06 '12 at 12:52

Most efficient way of writing a single producer/single consumer queue

1 Answers1

Linked