Memory barrier in the implementation of single producer single consumer

Question

The following implementation from Wikipedia:

volatile unsigned int produceCount = 0, consumeCount = 0;
TokenType buffer[BUFFER_SIZE];

void producer(void) {
    while (1) {
        while (produceCount - consumeCount == BUFFER_SIZE)
            sched_yield(); // buffer is full

        buffer[produceCount % BUFFER_SIZE] = produceToken();
        // a memory_barrier should go here, see the explanation above
        ++produceCount;
     }
}

void consumer(void) {
    while (1) {
        while (produceCount - consumeCount == 0)
            sched_yield(); // buffer is empty

        consumeToken(buffer[consumeCount % BUFFER_SIZE]);
        // a memory_barrier should go here, the explanation above still applies
        ++consumeCount;
     }
}

says that a memory barrier must be used between the line that accesses the buffer and the line that updates the Count variable.

This is done to prevent the CPU from reordering the instructions above the fence along-with that below it. The Count variable shouldn't be incremented before it is used to index into the buffer.

If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?

Thanks

@user3286661: It means that no memory barrier can make the code above well-defined C++. It doesn't help with the answering, it explains why the premise of your question is flawed. — MSalters, Jul 19 '16 at 09:35
@user3286661 We don't care about performance, right? This is just about correctness. (Because the performance of code that spins on `sched_yield` is going to be awful.) This must either be pseudo-core or platform-specific code. There is no portable rule for how `volatile` interacts with memory barriers and threads. — David Schwartz, Jul 19 '16 at 09:42
The question is about the concept of memory barriers. We don't care about performance. — nishantsingh, Jul 19 '16 at 09:44
@DavidSchwartz Wikipedia example is written in pseudocode, which really looks like Java. And there are several mentions of Java and no other languages in references and links. It is safe to assume that it is Java-like volatile which gives additional guarantees. — Revolver_Ocelot, Jul 19 '16 at 09:53
@Revolver_Ocelot Yeah, Java's memory model is the same as C++'s in this respect. — 2501, Jul 19 '16 at 09:55
@2501 C++ and Java are completely different with regard to what `volatile` does with threads. — David Schwartz, Jul 19 '16 at 09:56
@DavidSchwartz I'm talking about the model, not the keyword. You probably missed my comment complaining about volatile and C++. — 2501, Jul 19 '16 at 09:57
@2501 Are you suggesting a platform might have a memory barrier whose semantics aren't even defined with respect to `volatile`s? What use would such a thing be? — David Schwartz, Jul 19 '16 at 09:59
"*Does the CPU not take care of data dependency while instruction reordering?*" -- Wrong question. The question is whether it takes care of data dependencies as that data might be seen on another core. And the answer is "no, that would be very, very difficult since that other core might be doing anything at all." — David Schwartz, Oct 23 '16 at 05:09

score 3 · Answer 1 · answered Jul 19 '16 at 09:38

If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?

Good question.

In c++, unless some form of memory barrier is used (atomic, mutex, etc), the compiler assumes that the code is single-threaded. In which case, the as-if rule says that the compiler may emit whatever code it likes, provided that the overall observable effect is 'as if' your code was executed sequentially.

As mentioned in the comments, volatile does not necessarily alter this, being merely an implementation-defined hint that the variable may change between accesses (this is not the same as being modified by another thread).

So if you write multi-threaded code without memory barriers, you get no guarantees that changes to a variable in one thread will even be observed by another thread, because as far as the compiler is concerned that other thread should not be touching the same memory, ever.

What you will actually observe is undefined behaviour.

And what about the location of memory fence. Why is it between those two statements? — nishantsingh, Jul 19 '16 at 09:41
@user3286661the presence of a memory fence prevents reordering of memory writes across the fence, not just in the current thread, but also *as observed by the other thread*. This is the important part. It allows the memory update to be used as a signal across threads. — Richard Hodges, Jul 19 '16 at 09:48

score 2 · Accepted Answer · answered Jul 19 '16 at 09:44

It seems, that your question is "can incrementing Count and assigment to buffer be reordered without changing code behavior?".

Consider following code tansformation:

int count1 = produceCount++;
buffer[count1 % BUFFER_SIZE] = produceToken();

Notice that code behaves exactly as original one: one read from volatile variable, one write to volatile, read happens before write, state of program is the same. However, other threads will see different picture regarding order of produceCount increment and buffer modifications.

Both compiler and CPU can do that transformation without memory fences, so you need to force those two operations to be in correct order.

David Schwartz · Answer 3 · 2016-07-19T09:54:45.650

If a fence is not used, won't this kind of reordering violate the correctness of code?

Nope. Can you construct any portable code that can tell the difference?

The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?

Why shouldn't it? What would the payoff be for the costs incurred? Things like write combining and speculative fetching are huge optimizations and disabling them is a non-starter.

If you're thinking that volatile alone should do it, that's simply not true. The volatile keyword has no defined thread synchronization semantics in C or C++. It might happen to work on some platforms and it might happen not to work on others. In Java, volatile does have defined thread synchronization semantics, but they don't include providing ordering for accesses to non-volatiles.

However, memory barriers do have well-defined thread synchronization semantics. We need to make sure that no thread can see that data is available before it sees that data. And we need to make sure that a thread that marks data as able to be overwritten is not seen before the thread is finished with that data.

Memory barrier in the implementation of single producer single consumer

3 Answers3