2

Let's imagine a lock-free concurrent SPSC (single-producer / single-consumer) queue.

  • The producer thread reads head, tail, cached_tail and writes head, cached_tail.
  • The consumer thread reads head, tail, cached_head and writes tail, cached head.

Note, that cached_tail is accessed only by the producer thread, just like cached_head is accessed only by the consumer thread. They can be thought as private thread local variables, so they are unsynchronized, thus not defined as atomic.

The data layout of the queue is the following:

#include <atomic>
#include <cstddef>
#include <thread>

struct spsc_queue
{
    /// ...

    // Producer variables
    alignas(std::hardware_destructive_interference_size) std::atomic<size_t> head; // shared
    size_t cached_tail; // non-shared

    // Consumer variables
    alignas(std::hardware_destructive_interference_size) std::atomic<size_t> tail; // shared
    size_t cached_head; // non-shared

    std::byte padding[std::hardware_destructive_interference_size - sizeof(tail) - sizeof(cached_head)];
};

Since I want to avoid false sharing, I aligned head and tail to the L1 cache line size.

The pseudo-code-ish implementation of the push/pop operations is the following:

bool push(const void* elems, size_t n)
{
    size_t h = atomic_load(head, relaxed);

    if (num_remaining_storage(h, cached_tail) < n)
    {
        cached_tail = atomic_load(tail, acquire);

        if (num_remaining_storage(h, cached_tail) < n)
            return false;
    }

    // write from elems

    atomic_store(head, h + n, release);
    return true;
}

bool pop(void* elems, size_t n)
{
    size_t t = atomic_load(tail, relaxed);

    if (num_stored_elements(cached_head, t) < n)
    {
        cached_head = atomic_load(head, acquire);

        if (num_stored_elements(cached_head, t) < n)
            return false;
    }

    // read to elems

    atomic_store(tail, t + n, release);
    return true;
}

void wait_and_push(const void* elems, size_t n)
{
    size_t h = atomic_load(head, relaxed);

    while (num_remaining_storage(h, cached_tail) < n)
        cached_tail = atomic_load(tail, acquire);

    // write from elems

    atomic_store(head, h + n, release);
}

void wait_and_pop(void* elems, size_t n)
{
    size_t t = atomic_load(tail, relaxed);

    while (num_stored_elements(cached_head, t) < n)
        cached_head = atomic_load(head, acquire);

    // write to elems

    atomic_store(tail, t + n, release);
}

At initialization (not listed here), all the indices are set to 0. The functions num_remaining_storage and num_stored_elements are const functions performing simple calculations based on the passed arguments and the immutable queue capacity - they do not perform any atomic reads or writes.

Now the question is: do I need to align cached_tail and cached_head as well to completely avoid false sharing any of the indices, or it is okay as it is. Since cached_tail is producer private, and cached_head is consumer private, I think cached_tail can be in the same cache line as head (producer cache line), just like cached_head in the same cache line as tail (consumer cache line) without false sharing to ever happen.

Am I missing something?

plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • FWIW, I'm pretty sure you don't need to align the cached values, but didn't get enough conviction yet. – Jeffrey Apr 29 '20 at 17:48
  • You're probably going to want 3 cache lines: writer-only, reader-only, and stuff accessed by both. Ideally you can avoid having both threads write the shared line frequently. Even one writing and the other reading keeps flipping it from Modified to Shared and requiring the thread doing a store to do an RFO to regain exclusive ownership. But having the non-shared vars in separate cache lines could allow a bit more progress to be made / in-flight while waiting on those share requests or RFOs. – Peter Cordes Apr 30 '20 at 01:24
  • @PeterCordes I don't entirely understand your comment. Do you suggest to put both `cached_tail` and `cached_head` into a common, but third cache line? Even if they are not shared across threads, they are handled by different threads (`cached_tail` by the producer, `cached_head` by the consumer). Look at the beginning of my question at the access patterns. – plasmacel Apr 30 '20 at 01:59
  • I should have said "producer only" / "consumer only" / stuff accessed by both. Obviously variables that are only accessed by one of the threads should go in a cache line that the other thread never touches. IDK if you thought I was saying to break it down by write-only vs. read-only data? No, I meant writer = producer. Anyway, the hard question is how to split up other variables that are actually shared, and whether there's anything to gain or lose from having more lines so you can have vars that only the producer writes (read by consumer) in a line that the producer can always at least read. – Peter Cordes Apr 30 '20 at 02:07
  • @PeterCordes Ah, okay, it's clear now. Well, `head` and `cached_tail` are always updated together in `push()` - just like `tail` and `cache_head` in `pop()`. But `head` and `tail` are accessed by both threads. At the worst case I can use four cache lines, one for each. According to your suggestion `head` and `tail` should go to the same cache line, since they are accessed by both threads. That would definitely cause false sharing. I don't think it is a good idea, also all lock-free implementation I've seen put them (`head` and `tail`) into separate cache lines. – plasmacel Apr 30 '20 at 02:14
  • @PeterCordes What's the problem with the current 2 cache lines version? – plasmacel Apr 30 '20 at 02:19
  • `push` has to *modify* `cached_tail`, but `pop` will sometimes read `head` in that same line, taking it out of MESI Modified state. Your caching strategy makes that infrequent so it may be fine or at least not worth another extra cache line. Re: `head` and `tail` in the same line: I didn't think that through fully, or I was for some reason thinking of a linked-list implementation where pop detects empty via a NULL pointer in a node instead of by looking at the producer's state. So yes, as I eventually got to saying, it's a question of how to split up the vars that both threads R and/or W. – Peter Cordes Apr 30 '20 at 02:38
  • Oh, and in the fast path `push` only reads `cached_tail`, and that's still fast with the line in MESI Shared state. So yes this is probably good. If the queue is near-empty most of the time then contention is unavoidable and separate cache lines for the fully private `cached_tail` and `cached_head` are unlikely to help. Although they *might*? Also if we're lucky, the compiler can keep the cached head / tail in a register if called inside a fairly tight loop that doesn't inline too much other stuff. – Peter Cordes Apr 30 '20 at 02:44
  • @PeterCordes Haha, that was a typo. Fixed now. – plasmacel Apr 30 '20 at 13:16

1 Answers1

2

Thank you for providing the pseudocode - it is still lacking some details, but I think I get the basic idea. You have a bounded SPSC queue where the indexes can wrap around, and you use the cached_tail variable in push to check if there are free slots, so you can avoid loading tail from a potentially invalidated cache line (and vice versa for pop).

I would suggest to put head and cached_tail next to each other (i.e., on the same cache line), and tail and cached_head on a different one. push always reads both variables - head and cached_tail, so it makes sense to have them close together. cached_tail is only updated if there are no more free slots and we have to reload tail.

Your code is a bit thin on details, but it seems that there is some room for optimization:

bool push(const void* elems, size_t n)
{
    size_t h = atomic_load(head);

    if (num_remaining_storage(h, cached_tail) < n)
    {
        auto t = atomic_load(tail);
        if (t == cached_tail)
          return false;

       // we only have to update our cached_tail if the reloaded value
       // is different - and in this case it is guaranteed that there
       // is a free slot, so we don't have to perform a recheck.
        cached_tail = t;
    }

    // write from elems

    atomic_store(head, h + n);
    return true;
}

That way cached_tail is only ever updated when head is also updated, so this is another reason for them to be on the same cache line. Of course the same kind of optimization can be applied to pop as well.

That is exactly why I wanted to see some code, because the access pattern is crucial to determine which variables should share a cache line and which should not.

mpoeter
  • 2,574
  • 1
  • 5
  • 12
  • 1
    Good idea with delaying update of `cached_tail` or `cached_head`. That removes any possible benefit to having 4 separate cache lines. (Without that I was wondering if it made any sense to split them up so spinning on a retry of push or pop isn't writing the line the other thread needs to write. In a full or empty queue situation.) Even then I'm not sure it's significant. If the line is in Shared state the other thread could read it more easily, but would still have to RFO on the store anyway, whether we invalidate it again after their read or not. – Peter Cordes Apr 30 '20 at 11:40
  • 1
    (I think that same optimization can work for `pop`, or at least a similar one.) – Peter Cordes Apr 30 '20 at 11:41
  • 1
    @PeterCordes yes, of course the same optimization can be applied for `pop` too. I did want to mentioned that, but apparently forgot. Thanks for the note, I have updated my answer. – mpoeter Apr 30 '20 at 11:58
  • Great answer and idea on the further optimization! I updated my answer with some more details, including memory ordering and waiting versions of `push` and `pop` (`wait_and_push`, `wait_and_pop`). If you have any more suggestions about them, don't hesitate to share with me. ;] – plasmacel Apr 30 '20 at 13:01
  • I think I could use the same optimization you suggested to the spinwait versions as well. – plasmacel Apr 30 '20 at 13:22
  • "we only have to update our cached_tail if the reloaded value is different" - this is only true when `n == 1`, otherwise it only means that there is more space, but after the re-load it can be still not enough for `n`, so re-testing the condition `num_remaining_storage(h, cached_tail) < n` is required if `n > 1`. Still, it is a great optimization when `n == 1`. I have separate functions for batch (bulk) and single-value operations so there is no need for extra branches. – plasmacel Apr 30 '20 at 14:39
  • Yes you are right, I did not realize from your pseudocode that you want to perform bulk operations. However, I don't think that this optimization works for the blocking functions, since you are actually waiting for head/tail to be updated. FWIW: the memory orderings look fine. – mpoeter Apr 30 '20 at 15:06
  • @mpoeter There is a short chat about the blocking functions here, if you are interested (the point is that the delayed assignment of the cached indices is still advantageous in this case): https://chat.stackoverflow.com/rooms/212865/room-for-plasmacel-and-peter-cordes – plasmacel Apr 30 '20 at 15:29
  • 1
    Ah, yes you can of course avoid repeated writes to the cached values. In fact I would use a local variable to track the index and only write to the cached variable just once when there is enough space to proceed s Peter Codes already suggested. – mpoeter Apr 30 '20 at 20:54