Let's imagine a lock-free concurrent SPSC (single-producer / single-consumer) queue.
- The producer thread reads
head
,tail
,cached_tail
and writeshead
,cached_tail
. - The consumer thread reads
head
,tail
,cached_head
and writestail
,cached head
.
Note, that cached_tail
is accessed only by the producer thread, just like cached_head
is accessed only by the consumer thread. They can be thought as private thread local variables, so they are unsynchronized, thus not defined as atomic.
The data layout of the queue is the following:
#include <atomic>
#include <cstddef>
#include <thread>
struct spsc_queue
{
/// ...
// Producer variables
alignas(std::hardware_destructive_interference_size) std::atomic<size_t> head; // shared
size_t cached_tail; // non-shared
// Consumer variables
alignas(std::hardware_destructive_interference_size) std::atomic<size_t> tail; // shared
size_t cached_head; // non-shared
std::byte padding[std::hardware_destructive_interference_size - sizeof(tail) - sizeof(cached_head)];
};
Since I want to avoid false sharing, I aligned head
and tail
to the L1 cache line size.
The pseudo-code-ish implementation of the push
/pop
operations is the following:
bool push(const void* elems, size_t n)
{
size_t h = atomic_load(head, relaxed);
if (num_remaining_storage(h, cached_tail) < n)
{
cached_tail = atomic_load(tail, acquire);
if (num_remaining_storage(h, cached_tail) < n)
return false;
}
// write from elems
atomic_store(head, h + n, release);
return true;
}
bool pop(void* elems, size_t n)
{
size_t t = atomic_load(tail, relaxed);
if (num_stored_elements(cached_head, t) < n)
{
cached_head = atomic_load(head, acquire);
if (num_stored_elements(cached_head, t) < n)
return false;
}
// read to elems
atomic_store(tail, t + n, release);
return true;
}
void wait_and_push(const void* elems, size_t n)
{
size_t h = atomic_load(head, relaxed);
while (num_remaining_storage(h, cached_tail) < n)
cached_tail = atomic_load(tail, acquire);
// write from elems
atomic_store(head, h + n, release);
}
void wait_and_pop(void* elems, size_t n)
{
size_t t = atomic_load(tail, relaxed);
while (num_stored_elements(cached_head, t) < n)
cached_head = atomic_load(head, acquire);
// write to elems
atomic_store(tail, t + n, release);
}
At initialization (not listed here), all the indices are set to 0
.
The functions num_remaining_storage
and num_stored_elements
are const
functions performing simple calculations based on the passed arguments and the immutable queue capacity - they do not perform any atomic reads or writes.
Now the question is: do I need to align cached_tail
and cached_head
as well to completely avoid false sharing any of the indices, or it is okay as it is. Since cached_tail
is producer private, and cached_head
is consumer private, I think cached_tail
can be in the same cache line as head
(producer cache line), just like cached_head
in the same cache line as tail
(consumer cache line) without false sharing to ever happen.
Am I missing something?