Do I need MemoryBarrier to increase index in lockless collection

Question

I'm trying to implement a simple yet fast producer/consumer collection with the ability to be copied from another thread and to make it as fast as possible, therefore I am not using any locking mechanisms.

Basically the code looks like this (simplified):

var pc = new ProducerConsumer();
pc.Add(1);

var producerTask = Task.Run(() =>
{
    var loop = 1;
    while (true)
    {
        pc.Add(loop);
        if (pc.count == 5000) // example wraparound
        {
            loop++;
            pc.array[0] = loop;
            pc.count = 1; // clear to one element for simplicity of this example
        }
    }
});

var consumerTask = Task.Run(() =>
{
    while (true)
    {
        Console.WriteLine(pc.ToArray().First());
    }
});

Task.WaitAll(producerTask, consumerTask);

class ProducerConsumer
{
    public volatile int count = 0;
    public readonly int[] array = new int[5000];

    // not meant to be thread-safe, designed to be used by just one producer
    public void Add(int value)
    {
        array[count] = value;
        count++;
    }

    // should be thread-safe, used by multiple readers
    public int[] ToArray() => array[..count];
}

The idea is that the reader thread(s) should be free to access anything in array[0..count] at any time and get a valid value. The size of an array is constant.

It seems to be working fine and is 6x faster than ConcurrentBag<T> in benchmarks but do I have a guarantee that these two lines:

array[count] = value;
count++;

are going to be always called in this specific order and will not get for example optimized into:

array[count++] = value;

by a compiler or CPU?

Do I need Interlocked.MemoryBarrier() between these two instructions?

Do I need to mark count as volatile?

I would like to avoid unnecessary instructions as a simple MemoryBarrier slows down the loop by 25%. I tested it for some time with all optimizations enabled and on Release without debugger and seems that neither make any difference.

Actually List<T> from .NET would work great for my needs but for some reason they implemented .Add as:

_version++;
T[] array = _items;
int size = _size;
if ((uint)size < (uint)array.Length)
{
    _size = size + 1;
    array[size] = item;
}
else
{
    AddWithResize(item);
}

Is there a reason why was it implemented like this? Simple reordering of instructions in lines 6 and 7 could improve thread-safeness of generic List (even though I know it's not meant to be thread-safe).

Improving thread safeness (whatever that means) without actually providing thread safety is just a waste of resources. List is not thread safe, so why waste additional resources? Your code is not thread safe as well. Depending on the cpu architecture, the consumer may never see changes to count or array if cached in cpu. Also I'm not exactly sure if memory barriers is enough here. Why is the performance so important? Honestly I would just throw a lock here, and not worry about it. — freakish, Mar 21 '23 at 06:16
Do you have any real life documented examples of such behavior? — Adassko, Mar 21 '23 at 06:19
Please be sure that your code is not threadsafe - in contrast to `ConcurrentBag`. This might be a reason for your 6x faster code. I don't think that the compiler will pack that code to your sample because that changes the runtime behaviour and the compiler will not understand that implications. I'm pretty sure that such optimization will break much more code than yours. — Sebastian Schumann, Mar 21 '23 at 06:41
If performance is such a problem then remove `ToArray` and implement `IEnumerable` instead and run vom 0 to count - 1 without creating a copy of that array. If the caller needs an array the call can call `ToArray()` on that enumerable. — Sebastian Schumann, Mar 21 '23 at 06:44
@SebastianSchumann: yes, I have this implemented but it's not a part of the question. I am aware that the code is not fully thread-safe, what I really need is a List (or rather a bag - I only need to append items) with occasional peeking functionality. "Peeking" is a very rare situation that needs to create an instant snapshot of the list content accessible from another thread. I don't want to introduce locks just because of this rare occurence - adding a lock makes the usual flow of the code 25x slower — Adassko, Mar 21 '23 at 06:54
It seems to me in your example there is only one thread that is appending. All other threads are performing read-only operations. Since `int` gets atomic writes, your code should be thread-safe as it is, assuming you aren't that picky about how quickly the read-only threads get the latest data. The data may be slightly stale but won't ever be invalid. — John Wu, Mar 21 '23 at 07:01

score 3 · Accepted Answer · edited Mar 21 '23 at 16:07

3

_{(This answer was posted during the 2nd revision of the question, before the OP added the volatile keyword in the ProducerConsumer.count field.)}

You don't need a full fence Interlocked.MemoryBarrier, but you do need at least one-way ordering of your memory operations when you interact with the ProducerConsumer class. The producer should do a release-store (Volatile.Write), and the consumers should do an acquire-load (Volatile.Read), so the compiler/JIT will maintain order at compile time, and use instructions that stop the CPU from reordering at run-time. Without this, array[count] might read a value from before the = value store was visible. As long as you have acquire/release synchronization between threads, and the count is only increasing, your class should be safe for single-producer/multiple-consumers scenarios.

But apparently the count is not always increasing. When the array is full, the producer is resetting the count to 1. This is a huge problem, that cannot be solved with barriers alone. The problem is that the producer has no way of knowing when all consumers have finished reading the array, so that it can safely reset it. When you write multithreaded code you should always have in mind that the operating system can suspend any thread at any time, for a duration in the range of 10-30 milliseconds, and sometimes more (demo). So a consumer that executes the line => array[..count]; can read the count, then get suspended for 30 msec, and then resume and use the obsolete count value to copy the first count items of the array. So it might read 4,500 items at a time that the current count is 500. This is unlikely to be what you want.

One way to solve this problem is to store the state of the ProducerConsumer in an internal State class, and implement the reset by swapping the old State with a new one. This way any consumer that has grabbed a reference to a State, will continue interacting with this state even in case it gets suspended immediately after grabbing it. So it will read consistent data, although potentially a bit stale. Here is an example:

public class ProducerConsumer
{
    private State _state = new State();

    private class State
    {
        public readonly int[] Items = new int[5000];
        public int Count = 0;
    }

    // Not thread-safe (it is intended for a single producer)
    public void Add(int value)
    {
        if (_state.Count >= _state.Items.Length - 1)
        {
            State newState = new();
            newState.Items[0] = value;
            newState.Count = 1;
            Volatile.Write(ref _state, newState);
            return;
        }
        Volatile.Write(ref _state.Items[_state.Count], value);
        Volatile.Write(ref _state.Count, _state.Count + 1);
    }

    // Thread-safe (multiple consumers are OK)
    public int[] ToArray()
    {
        State state = Volatile.Read(ref _state);
        int count = Volatile.Read(ref state.Count);
        int[] array = new int[count];
        for (int i = 0; i < count; i++)
            array[i] = state.Items[i];
        return array;
    }
}

In a previous version of this answer, each and every element of the array was read with volatile semantics: array[i] = Volatile.Read(ref state.Items[i]);. According to a comment by Peter Cordes, this is not needed:

You don't need volatile reads of each state.Items[i]. You've already synchronized-with the thread that stored those items when you did int count = Volatile.Read(ref state.Count); (an acquire load that saw the value from a release store), so it's now safe to do non-atomic accesses to state.Items[0 .. count-1]; those array elements have already been written. The state reference/pointer came from one acquire-load, so we have a reference to an object that's at most being monotonically appended to, and the Count was written with a release store after the array elements.

The .NET contains an internal class named SingleProducerSingleConsumerQueue<T>, which is exactly what its name indicates. You could study the source code for educational purposes if you want.

edited Mar 21 '23 at 16:07

Peter Cordes

328,167
45
605
847

answered Mar 21 '23 at 08:55

Theodor Zoulias

34,835
7
69
104

1

(without release/acquire) *the compiler and Jitter are free to reorder the instructions* - and the CPU would be free to reorder the resulting memory operations triggered by those instructions, unless it happens to be x86. But note `Volatile.Write` isn't a release *fence*, it's a release *operation*. I'm not sure C# has an exact equivalent for C++ `std::atomic_thread_fence(std::memory_order_release)`, but note that that's stronger than just a release-store *operation* on some variable. – Peter Cordes Mar 21 '23 at 11:40
1

See https://preshing.com/20130922/acquire-and-release-fences/ and especially https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ for the important distinction between a fence and an operation (in a C++11 context, but C# `Volatile.Write` has AFAIK exactly the same ordering semantics as a C++ release *operation* like `foo.store(value, release)`, and thus can't be described as a release *fence*.) – Peter Cordes Mar 21 '23 at 11:42
1

(BTW, I haven't read the whole question, or the whole rest of your answer yet, but yeah resetting the count and replacing the array with a smaller one is highly problematic. Some RCU-style abandoning (to GC) of the whole array + count object and allocating a fresh one could fix it like I think you're suggesting, so any given `count` instance is monotonically increasing and any thread accessing it will also be accessing an array it goes with. And you allocate a new `count` along with a new smaller array and put them in a struct, and update a reference to that pair.) – Peter Cordes Mar 21 '23 at 11:45
Thank you for the answer, yes - this is not an actual implementation I made, it's a little more complex and actually involves swapping the collections when they get full so the `count` is actually always just incrementing. Isn't using `Volatile.Read` and `Volatile.Write` just the same as marking the field as `volatile`? – Adassko Mar 21 '23 at 13:46
@Adassko AFAIK marking a field as `volatile` is the same as using `Volatile.Read` and `Volatile.Write` in **all** interactions with the field. In my example the producer reads the `_state.Count` and the `_state.Items.Length` directly, and that's why I preferred using the static `Volatile` methods over the `volatile` keyword for the fields. – Theodor Zoulias Mar 21 '23 at 14:08
@PeterCordes I just read the two linked articles. Larry and Sergey are very much look-alike! I have trouble translating the C++ code to the C# world. What is the C++ equivalent of the C# code `Volatile.Write(ref count, 1);`? Is it (A) `count.store(1, std::memory_order_release);` or (B) `std::atomic_thread_fence(std::memory_order_release); count.store(1, std::memory_order_relaxed);`? – Theodor Zoulias Mar 21 '23 at 14:54
1

I'm pretty sure `Volatile.Write` is just a release *operation* like `count.store(1, release)`. The manual says "If a read or write appears before this method in the code, the processor cannot move it after this method." That one-way ordering isn't strong enough for a fence, despite the docs earlier saying "On systems that require it, inserts a *memory barrier*". I'm pretty sure that's misleading language, and it could compile to AArch64 `stlr`, not needing separate `dmb` + `str` (https://godbolt.org/z/PWfzcbsr6). Unfortunately https://sharplab.io/ can only target x64, not AArch64. – Peter Cordes Mar 21 '23 at 15:28
The thing that's missing in https://learn.microsoft.com/en-us/dotnet/api/system.threading.volatile.write?view=net-8.0 is anything that would stop a later store from reordering with the `Volatile.Write`, and then being able to reorder with loads and stores from before the `Volatile.Write`. A release fence has to block LoadStore and StoreStore reordering between all earlier loads/stores and all later stores, vs. a release *operation* just blocking that reordering with this one store. That store itself can reorder as far as it wants in one direction, leaving everything else mixed. – Peter Cordes Mar 21 '23 at 15:33
@PeterCordes I think that I should remove the *"release-fence"* and *"acquire-fence"* terms from the answer, instead of renaming them to *"release-operation"* and *"acquire-operation"*. These are ultra subtle concepts, that C# programmers are not familiar with. Is it right to call `Volatile.Write` and `Volatile.Read` as half fences? If not, should I even avoid comparing these `Volatile` methods with the `Interlocked.MemoryBarrier`? I would like the answer to be practical, and avoid pointing C# people down the rabbit hole of reading and learning tons of things that they can't use! – Theodor Zoulias Mar 21 '23 at 15:47
1

You don't need volatile reads of each `state.Items[i]`. You've already synchronized-with the thread that stored those items when you did `int count = Volatile.Read(ref state.Count);` (an acquire load that saw the value from a release store), so it's now safe to do non-atomic accesses to `state.Items[0 .. count-1]`; those array elements have already been written. The `state` reference/pointer came from one acquire-load, so we have a reference to an object that's at most being monotonically appended to, and the `Count` was written with a release store after the array elements. – Peter Cordes Mar 21 '23 at 15:47
1

I don't know what terminology or mental models are standard in the C# community. AFAIK, C# doesn't have release or acquire fences (an equivalent for `atomic_thread_fence(acquire)`), only its full barrier or acquire/release *operations* vs. Interlocked seq_cst operations. Understanding the difference between an operation and a fence was kind of an eye-opener for me in terms of understanding why weakly-ordered ISAs can be more efficient for some cases, and the C++ standard's memory ordering rules. But it's been a while since then, and I don't remember how I thought about stuff before. – Peter Cordes Mar 21 '23 at 15:53
If you wanted to use `state.Items[i]` as an index into some other thing produced by another thread, then yeah perhaps you'd want to `Volatile.Read` it, or yes put a fence (`Interlocked.MemoryBarrier`) after copying the whole array. More than likely you'd be fine in practice without that, due to dependency ordering which is something all CPUs except DEC Alpha do; `i = shared.load(consume);` `tmp = arr[i];` works like `acquire` but without barriers even on weakly-ordered ISAs, or would if `consume` wasn't deprecated in C++ and treated as `acquire`). IDK if C# guarantees anything. – Peter Cordes Mar 21 '23 at 15:58
1

@PeterCordes thanks for the edit. I was trying to improve the same paragraph, but your edit is better. :-) – Theodor Zoulias Mar 21 '23 at 16:10
1

I took a stab at wording that first paragraph to explain that acq/rel synchronization is necessary and sufficient without going down the rabbit hole of operation vs. fence or "half-fence" or "acquire fence". I'm sure you're better able to put yourself in the shoes of a reader with C# experience who isn't familiar with some of that terminology, so have a look and let me know and/or edit yourself. I think the acquire and release concept is very important here, so it's a good idea to use that terminology even if it's new to some readers, and even though MS's Volatile.Write doc avoids them. – Peter Cordes Mar 21 '23 at 16:10
1

@PeterCordes FYI in a recent discussion on GitHub I posted a code fragment showing the combination `Interlocked.CompareExchange`+`Volatile.Write` used for synchronization purposes, and Stephen Toub [in a comment](https://github.com/dotnet/runtime/issues/81877#issuecomment-1435978758) referred to the `Volatile.Write` as *"store release fence"*. I am not sure that even Microsoft's top C# engineers are aware of the subtle difference between *"release fence"* and *"release operation"*! That discussion is now locked, so it's not possible to comment there. – Theodor Zoulias Mar 26 '23 at 23:39
1

To be slightly fair, C# doesn't *have* an equivalent for C++ `atomic_thread_fence(release)` so https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ would never come up. So hopefully when he says "release fence", he's attaching the correct concept to it, of ordering for the store itself, not creating release ordering between later plain stores and earlier plain loads/stores. Because outside of x86, it won't do that if it just compiles to AArch64 `stlr` or something. In the compiler, it does need one-way-fence semantics for compile-time ordering of the op. – Peter Cordes Mar 27 '23 at 00:00
Hmm, he goes on to say "*your numbers would look different on arm where there's real cost to the fence*". `stlr` isn't as cheap as `str`, but it's still just a plain store with some ordering semantics that don't include blocking StoreLoad (draining the store buffer); it's only as strong as what x86 always does. I'd assume it's handled fairly efficiently, as long as there isn't an `ldar` (seq_cst acquire load) soon after which *would* force it to wait for the store buffer to drain. (I haven't looked at the whole thread to get more context for those comments.) – Peter Cordes Mar 27 '23 at 00:15
1

I finally got curious enough to see dig up some details on how C# compiles for AArch64: https://devblogs.microsoft.com/dotnet/arm64-performance-improvements-in-dotnet-7/ from Sep 2022 confirms that `Volatile.Write` does JIT to `stlr` on AArch64, not `dmb ish` ; `stl`. But until a recent change, assignment to `volatile int` *did* use a full barrier. (The article correctly refers to `dmb ish` as a "two-way barrier", vs. " one-way barrier using store-release semantics" to describe `stlr` used for `Volatile.Write`). Anyway, .NET 7 compiles `volatile` assignment the same as Volatile.Write now. – Peter Cordes Mar 27 '23 at 00:20
@PeterCordes interesting! Btw Stephen Toub makes a comparison with the [`SpinLock.Exit`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.spinlock.exit) method. You can see the source code [here](https://github.com/dotnet/runtime/blob/6a533c1a364e41309394d2253ad36b2e001ff23f/src/libraries/System.Private.CoreLib/src/System/Threading/SpinLock.cs#L507). When the `useMemoryBarrier` is `true`, the `Interlocked.Exchange` is used. Otherwise the `volatile _owner` field is just updated directly. – Theodor Zoulias Mar 27 '23 at 00:25
*whether a memory fence should be issued in order to immediately publish the exit operation to other threads* - That's the opposite of how fences work (they stall this thread's later memory access until earlier operations have completed). Reduced competition from later loads (and maybe stores) might make a store visible a tiny bit sooner, but at a much larger cost to this thread's throughput. But it's a common misconception. It's a SpinLock exit so it has to use at least a release operation, otherwise what you'd want is a relaxed store that can (on AArch64) commit before older stores. – Peter Cordes Mar 27 '23 at 00:41
Maybe Stephen has experimental evidence that contradicts my assumption that it would rarely be worth it to use a full barrier instead of just a release store (like Interlocked.Exchange instead of Volatile.Write) for this use-case. Like if later loads really do tend to compete with the RFO to get ownership of the cache line holding the lock variable. – Peter Cordes Mar 27 '23 at 00:45
@PeterCordes btw keep in mind that you are talking with a guy (myself) who has almost zero familiarly with assembly and harware-related concepts. C# is as "close-to-the-metal" as I can go. So when you talk about stores and releases, or mentioning obscure three-letter instruction names, I do my best to guess what these terms might mean, but most likely my guesses are wrong. :-) – Theodor Zoulias Mar 27 '23 at 01:02
1

Ok, in higher level terms, that asm output confirms that `Volatile.Write` only has one-way ordering, that it's just a release *operation* in C++-style terminology, not a release fence that can keep other independent operations on either side of it from reordering with each other. In terms of how that Volatile.Write operation itself can reorder, you *can* describe that as a one way "release fence" in C#, if you don't need a name for what `std::atomic_thread_fence(release)` does (a 2-way fence that doesn't include StoreLoad ordering). – Peter Cordes Mar 27 '23 at 01:27

Do I need MemoryBarrier to increase index in lockless collection

1 Answers1