2
  • As known, on x86_64 can be Store-Load reordering, if between Store & Load is no MFENCE.

Intel® 64 and IA-32 Architectures

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

  • Also known, that in such example can be Store-Load reordering

c.store(relaxed) <--> b.load(seq_cst): https://stackoverflow.com/a/42857017/1558037

// Atomic load-store
void test() {
    std::atomic<int> b, c;
    c.store(4, std::memory_order_relaxed);          // movl 4,[c];
    int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
}

can be reordered to:

// Atomic load-store
void test() {
    std::atomic<int> b, c;
    int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
    c.store(4, std::memory_order_relaxed);          // movl 4,[c];
}

Because, there is no MFENCE on x86_64:


But is there a really working example which showing the side effect of Store-Load reordering on x86_64?

Example, that shows correct result when used Store(seq_cst), Load(seq_cst), but shows wrong result when used Store(relaxed), Load(seq_cst).

Or is Store-Load reordering allowed on x86_64 because it can not be detected and shown in a program?

Community
  • 1
  • 1
Alex
  • 12,578
  • 15
  • 99
  • 195
  • Maybe the example given in this answer could fail on x86_64: [A. Williams exemple](http://stackoverflow.com/a/14864466/5632316) – Oliv Mar 20 '17 at 16:50
  • @Oliv Thank you. Yes, this is a well-known canonical example of `seq_cst`, but there is no in 1 thread sequence of operations: `store(), load()` – Alex Mar 20 '17 at 17:06

2 Answers2

4

Yes, there is example of Store-Load reordering on C++11 and x86_64.

First, we strictly prove the correctness of our code. And then in this code we will remove the mfence barrier between the STORE and the LOAD and see that the algorithm breaks down.

There is custom lock (spin-lock) which implemented without CAS/RMW-operations, with only Load & Store for limited number of threads, where each thread numerated 0-4:

// example of Store-Load reordering if used: store(release)
struct lock_t {
    static const size_t max_locks = 5;
    std::atomic<int> locks[max_locks];

    bool lock(size_t const thread_id) {

        locks[thread_id].store(1, std::memory_order_seq_cst);                     // Store
        // store(seq_cst): mov; mfence;
        // store(release): mov;

        for (size_t i = 0; i < max_locks; ++i)
            if (locks[i].load(std::memory_order_seq_cst) > 0 && i != thread_id) { // Load
                locks[thread_id].store(0, std::memory_order_release);   // undo lock
                return false;
            }
        return true;
    }

    void unlock(size_t const thread_id) {
        locks[thread_id].store(0, std::memory_order_release);
    }
};

  1. First we prove the correctness of the algorithm strictly, has acquire-release-semantic:

enter image description here


  1. Then we will show how you can brake down our lock-algorithm - result should be: 20000:

C++ diff:

enter image description here


  1. Then we show the difference between the assembler code:

Asm x86_64 diff:

enter image description here

Because It is strictly proved that a "good" algorithm is correct. And since we see that a "bad" algorithm does not work correctly (result 19976 is not equal to 20000). And the only difference between them is - the barrier mfence between STORE and LOAD. Therefore, we have provided the algorithm in which the Store-Load reordering occurs.

Also, there is at least one example of Store-Load reordering - which is a bit like our example: Can x86 reorder a narrow store with a wider load that fully contains it?

Community
  • 1
  • 1
Alex
  • 12,578
  • 15
  • 99
  • 195
  • 1
    Also another example:http://preshing.com/20120515/memory-reordering-caught-in-the-act/ – Alex Mar 31 '17 at 19:47
0

The compiler does not reorder loads and stores around std::memory_order_seq_cst operation.

The CPU may reorder these because there are no dependencies between the store and the load. In other words, the store may complete after the load. However, there is no way to observe the difference because loads do not have side effects.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Thank you! But why do you think that compiler allows this reorder to CPU, but compiler itself doesn't do this? As quoted here from C++ Standard, it allows to do this reordering by compiler: http://stackoverflow.com/a/42857017/1558037 I showed an example in my answer above that this reordering can be in real example - this significantly disrupts the work of the program. There is another one example that shows `Store(release)-Load(seq)` reordering on C++ & x86_64: http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it/39007998#39007998 – Alex Mar 22 '17 at 19:09
  • seq_cst operations only have a total order with respect to other seq_cst operations. `atomic_signal_fence(mo_seq_cst)` blocks reordering, though. – Peter Cordes Aug 26 '17 at 15:57