Is there a really working example which showing the side effect of Store-Load reordering on x86_64?

Question

As known, on x86_64 can be Store-Load reordering, if between Store & Load is no MFENCE.

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

Also known, that in such example can be Store-Load reordering

c.store(relaxed) <--> b.load(seq_cst): https://stackoverflow.com/a/42857017/1558037

// Atomic load-store
void test() {
    std::atomic<int> b, c;
    c.store(4, std::memory_order_relaxed);          // movl 4,[c];
    int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
}

can be reordered to:

// Atomic load-store
void test() {
    std::atomic<int> b, c;
    int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
    c.store(4, std::memory_order_relaxed);          // movl 4,[c];
}

Because, there is no MFENCE on x86_64:

clang 4.0.0 - x86_64: https://godbolt.org/g/N9CPyJ
gcc 7.0 - x86_64: https://godbolt.org/g/MdjvI0

But is there a really working example which showing the side effect of Store-Load reordering on x86_64?

Example, that shows correct result when used Store(seq_cst), Load(seq_cst), but shows wrong result when used Store(relaxed), Load(seq_cst).

Or is Store-Load reordering allowed on x86_64 because it can not be detected and shown in a program?

Maybe the example given in this answer could fail on x86_64: [A. Williams exemple](http://stackoverflow.com/a/14864466/5632316) — Oliv, Mar 20 '17 at 16:50
@Oliv Thank you. Yes, this is a well-known canonical example of `seq_cst`, but there is no in 1 thread sequence of operations: `store(), load()` — Alex, Mar 20 '17 at 17:06

score 4 · Accepted Answer · edited May 23 '17 at 11:55

Yes, there is example of Store-Load reordering on C++11 and x86_64.

First, we strictly prove the correctness of our code. And then in this code we will remove the mfence barrier between the STORE and the LOAD and see that the algorithm breaks down.

There is custom lock (spin-lock) which implemented without CAS/RMW-operations, with only Load & Store for limited number of threads, where each thread numerated 0-4:

// example of Store-Load reordering if used: store(release)
struct lock_t {
    static const size_t max_locks = 5;
    std::atomic<int> locks[max_locks];

    bool lock(size_t const thread_id) {

        locks[thread_id].store(1, std::memory_order_seq_cst);                     // Store
        // store(seq_cst): mov; mfence;
        // store(release): mov;

        for (size_t i = 0; i < max_locks; ++i)
            if (locks[i].load(std::memory_order_seq_cst) > 0 && i != thread_id) { // Load
                locks[thread_id].store(0, std::memory_order_release);   // undo lock
                return false;
            }
        return true;
    }

    void unlock(size_t const thread_id) {
        locks[thread_id].store(0, std::memory_order_release);
    }
};

First we prove the correctness of the algorithm strictly, has acquire-release-semantic:

Then we will show how you can brake down our lock-algorithm - result should be: 20000:
- Good example, where is no Store-Load reordering (result: 20000): http://coliru.stacked-crooked.com/a/baba611d686f0320
- Bad example, where is Store-Load reordering (result: 19976): http://coliru.stacked-crooked.com/a/99ff821b9f0127f4

C++ diff:

Then we show the difference between the assembler code:
- Good example, where is no Store-Load reordering (there is mfence): https://godbolt.org/g/WrCiyW
- Bad example, where is Store-Load reordering (there is no mfence): https://godbolt.org/g/Eo3TXR

Asm x86_64 diff:

Because It is strictly proved that a "good" algorithm is correct. And since we see that a "bad" algorithm does not work correctly (result 19976 is not equal to 20000). And the only difference between them is - the barrier mfence between STORE and LOAD. Therefore, we have provided the algorithm in which the Store-Load reordering occurs.

Also, there is at least one example of Store-Load reordering - which is a bit like our example: Can x86 reorder a narrow store with a wider load that fully contains it?

Also another example:http://preshing.com/20120515/memory-reordering-caught-in-the-act/ — Alex, Mar 31 '17 at 19:47

Maxim Egorushkin · Answer 2 · 2017-03-21T10:36:04.460

0

The compiler does not reorder loads and stores around std::memory_order_seq_cst operation.

The CPU may reorder these because there are no dependencies between the store and the load. In other words, the store may complete after the load. However, there is no way to observe the difference because loads do not have side effects.

edited Mar 21 '17 at 10:36

answered Mar 20 '17 at 18:49

Maxim Egorushkin

131,725
17
180
271

Thank you! But why do you think that compiler allows this reorder to CPU, but compiler itself doesn't do this? As quoted here from C++ Standard, it allows to do this reordering by compiler: http://stackoverflow.com/a/42857017/1558037 I showed an example in my answer above that this reordering can be in real example - this significantly disrupts the work of the program. There is another one example that shows `Store(release)-Load(seq)` reordering on C++ & x86_64: http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it/39007998#39007998 – Alex Mar 22 '17 at 19:09
seq_cst operations only have a total order with respect to other seq_cst operations. `atomic_signal_fence(mo_seq_cst)` blocks reordering, though. – Peter Cordes Aug 26 '17 at 15:57

Is there a really working example which showing the side effect of Store-Load reordering on x86_64?

2 Answers2

Linked

Related