Using memory barriers to force in-order execution

Question

Trying to go on with my idea that using both software and hardware memory barriers I could disable the out-of-order optimization for a specific function inside a code that is compiled with compiler optimization, and therefore I could implement software semaphore using algorithms like Peterson or Deker that requires no out-of-order execution, I have tested the following code that contains both SW barrier asm volatile("": : :"memory") and gcc builtin HW barrier __sync_synchronize:

#include <stdio.h>
int main(int argc, char ** argv)
{
    int x=0;
    asm volatile("": : :"memory");
    __sync_synchronize();
    x=1;
    asm volatile("": : :"memory");
    __sync_synchronize();
    x=2;
    asm volatile("": : :"memory");
    __sync_synchronize();
    x=3;
    printf("%d",x);
    return 0;
}

But the compilation output file is:

main:
.LFB24:
    .cfi_startproc
    subq    $8, %rsp
    .cfi_def_cfa_offset 16
    mfence
    mfence
    movl    $3, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    mfence
    call    __printf_chk
    xorl    %eax, %eax
    addq    $8, %rsp

And if I remove the barriers and compile again, I get:

main
.LFB24:
    .cfi_startproc
    subq    $8, %rsp
    .cfi_def_cfa_offset 16
    movl    $3, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
    xorl    %eax, %eax
    addq    $8, %rsp

both compiled with gcc -Wall -O2 in Ubuntu 14.04.1 LTS, x86.

The expected result was that the output file of the code that contains the memory barriers will contain all the assignments of the values I have in my source code, with mfence between them.

According to a related StackOverflow post -

gcc memory barrier __sync_synchronize vs asm volatile("": : :"memory")

When adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier

And later on:

However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.

But as you can see, the only difference between the code with the memory barriers and the code without them is that the former one contains mfence in a way I was not expected to see it, and not all the assignments are included.

Why is the output file of the file with the memory barriers was not as I expected- Why does the mfence order has been altered? Why did the compiler remove some of the assignments? Is the compiler allowed to make such optimizations even if the memory barrier is applied and separates every single line of code?

References to the memory barrier types and usage:

Memory Barriers - http://bruceblinn.com/linuxinfo/MemoryBarriers.html
GCC Builtins - https://gcc.gnu.org/onlinedocs/gcc-4.4.3/gcc/Atomic-Builtins.html

Terminology: **out-of-order execution is separate from memory reordering**. Even in-order CPUs are pipelined and benefit from a store buffer, especially for stores that miss in L1. (https://en.wikipedia.org/wiki/MESI_protocol#Memory_Barriers. Once they're known not to be speculative, they can be tracked only by memory-ordering logic (to enforce StoreStore and LoadStore ordering if needed) until they actually commit to L1 cache, after the pipeline has forgotten about them.) `MFENCE` doesn't serialize the pipeline; it only serializes the order that memory operations become globally visible. — Peter Cordes, Aug 03 '16 at 15:45

a3f · Accepted Answer · 2016-08-03T19:59:50.293

4

The memory barriers tell the compiler/CPU that instruction shouldn't be reordered across the barrier, they don't mean that writes that can be proven pointless have to be done anyway.

If you define your x as volatile, the compiler can't make the assumption, that it's the only entity that cares about xs value and has to follow the rules of the C abstract machine, which is for the memory write to actually happen.

In your specific case you could then skip the barriers, because it's already guaranteed that volatile accesses aren't reordered against each other.

If you have C11 support, you are better off using _Atomics, which additionally can guarantee that normal assignments won't be reordered against your x and that the accesses are atomic.

EDIT: GCC (as well as clang) seem to be inconsistent in this regard and won't always do this optimizaton. I opened a GCC bug report regarding this.

edited Aug 03 '16 at 19:59

answered Aug 03 '16 at 11:05

a3f

8,517
1
41
46

3

You wrote a much better answer than me. – 2501 Aug 03 '16 at 11:18
Right answer. I have tested it now with `volatile`, and the code with the memory barrier was compiled right as I expected (while the code without the memory barrier was still optimized a bit). Unfortunately I cannot test the `atomic` as I don't have C11 support. – izac89 Aug 03 '16 at 11:26
@2501 Thanks. Feel free to extend it, if you think something could be improved. :) – a3f Aug 03 '16 at 11:30
@a3f Are you sure? Try to compile with `gcc -std=c11`. – fuz Aug 03 '16 at 13:08
@FUZxxl What exactly should I be looking at? – a3f Aug 03 '16 at 13:18
@a3f If the compiler accepts this option, it supports C11. Note that this is not the default on old gcc versions (older than 5.0) and C11 features might not be available if the dialect isn't set to C11 or newer. – fuz Aug 03 '16 at 13:28
@Fuzxxl Uhm, Did you maybe mean to address OP? – a3f Aug 03 '16 at 13:41
1

@user2162550 Have you tried to compile with `gcc -std=c11`? – fuz Aug 03 '16 at 14:02
Now I wonder (and don't have gcc around for next few hours), if I would alias `int x;` as memory pointer `int *xptr = &x;` and use that one to write value, would the optimizer still manage to remove the x writes completely (understanding fully the uselessness of alias), or would it get afraid of pointer usage and do the write with `mfence`? – Ped7g Aug 03 '16 at 15:21
@Ped7g Pointer usage is optimized as expected. I found what looks like a missed optimization though and have edited the answer. – a3f Aug 03 '16 at 17:12
What you reported is not a bug. `static int y;` can be optimized away because no other functions in the compilation unit observe it. There are potential observers for the stores to (global) `int x;`, so it can't be optimized away. [If you include a `int gety(){return y;}` in the compilation unit, f() and g() compile the same](https://godbolt.org/g/xt5kGz). – Peter Cordes Aug 03 '16 at 18:09
@PeterCordes But GCC doesn't have to assume that these other observers are asynchronous. [What matters is the value that is stored into x at the end, see this example without barriers](https://godbolt.org/g/YkBokp). – a3f Aug 03 '16 at 18:45
@a3f: Right, unless you use asm statements that could contain code that observes the `x` or `y`. gcc has to assume that since you used a `"memory"` clobber. So that's part of the semantics of that form of compiler barrier. – Peter Cordes Aug 03 '16 at 18:50
1

@PeterCordes This makes sense, but if `x = 1` can be observed by the `asm` and therefore has to be written, why wouldn't the same apply to `static y`? – a3f Aug 03 '16 at 19:06
Hrm, good question. Yeah, I get a link error if I reference `y` when it's optimized away (https://godbolt.org/g/lsKvqh). See the notes and comments in that godbolt link. I'm not sure how to explain this, but I'm still sure that the stores are optimized away only when `y` doesn't exist at all. Obviously the better way to write such asm would use an input, output, or read-write operand with a constraint, instead of just referencing the symbol directly. That also avoids issues with C++ name mangling making it hard to actually reference `y`. But that isn't why the compiler assumes you don't. – Peter Cordes Aug 03 '16 at 19:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/119045/discussion-between-a3f-and-peter-cordes). – a3f Aug 03 '16 at 19:19

Using memory barriers to force in-order execution

1 Answers1

Linked