I've been banging my head against (my attempt) at a lock-free multiple producer multiple consumer ring buffer. The basis of the idea is to use the innate overflow of unsigned char and unsigned short types, fix the element buffer to either of those types, and then you have a free loop back to beginning of the ring buffer.
The problem is - my solution doesn't work for multiple producers (it does though work for N consumers, and also single producer single consumer).
#include <atomic>
template<typename Element, typename Index = unsigned char> struct RingBuffer
{
std::atomic<Index> readIndex;
std::atomic<Index> writeIndex;
std::atomic<Index> scratchIndex;
Element elements[1 << (sizeof(Index) * 8)];
RingBuffer() :
readIndex(0),
writeIndex(0),
scratchIndex(0)
{
;
}
bool push(const Element & element)
{
while(true)
{
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
{
return false;
}
if(scratchIndex.compare_exchange_strong(
currentWriteIndex, nextWriteIndex))
{
elements[currentWriteIndex] = element;
writeIndex = nextWriteIndex;
return true;
}
}
}
bool pop(Element & element)
{
Index currentReadIndex = readIndex.load();
while(true)
{
const Index currentWriteIndex = writeIndex.load();
const Index nextReadIndex = currentReadIndex + 1;
if(currentReadIndex == currentWriteIndex)
{
return false;
}
element = elements[currentReadIndex];
if(readIndex.compare_exchange_strong(
currentReadIndex, nextReadIndex))
{
return true;
}
}
}
};
The main idea for writing was to use a temporary index 'scratchIndex' that acts a pseudo-lock to allow only one producer at any one time to copy-construct into the elements buffer, before updating the writeIndex and allowing any other producer to make progress. Before I am called heathen for implying my approach is 'lock-free' I realise that this approach isn't exactly lock-free, but in practice (if it would work!) it is significantly faster than having a normal mutex!
I am aware of a (more complex) MPMC ringbuffer solution here http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue, but I am really experimenting with my idea to then compare against that approach and find out where each excels (or indeed whether my approach just flat out fails!).
Things I have tried;
- Using compare_exchange_weak
- Using more precise std::memory_order's that match the behaviour I want
- Adding cacheline pads between the various indices I have
- Making elements std::atomic instead of just Element array
I am sure that this boils down to a fundamental segfault in my head as to how to use atomic accesses to get round using mutex's, and I would be entirely grateful to whoever can point out which neurons are drastically misfiring in my head! :)