If you have no experience with implementing a lockless queue, the only answer can be: The most efficient (and safe) way is to use a lock, as provided by pthreads (mutex or cond var).
Lockless algorithms will usually (but not necessarily) give you a little extra performance, but they can go terribly wrong if you don't know exactly what you're doing.
On the other hand, the phtreads implementation under Linux avoids locks when possible and uses futex
when it needs to (again, futex
is something that's fast but where you should know what you're doing).
Such a queue has no trouble passing a hundred thousand tasks per second. This may seem limiting, but really if you need to pass 10 million tasks per second, you are doing it wrong. Ideally you will maybe pass a few dozen to a hundred or so tasks per second on a queue (fewer tasks, but bigger workgroups). You want to e.g. create 50 tasks working on half a megabyte of data each, not 25 million tasks working on one byte of data.
If you still insist on giving a lockless implementation a try (maybe out of academic interest), you will need an atomic compare-exchange operation (look up "legacy __sync functions" in the GCC documentation for C, for C++ you would use the new atomic ops).
Be sure to read up on subtle details like ABA, for which you usually need some kind of pointer manipulation (storing a refcount in the lowest 3 bits) or a double word exchange with an explicit reference count.
Alternatively, a lockless queue can be implemented with only atomic add or without any atomic operations at all (see "fast forward queue" if you're interested just for curiosity), if you make some assumptions. However, these only work if the assumptions hold true, and they are even more error-prone, so best stay away from them.