3

I am reading about memory barriers and what I can summarize is that they prevent instruction re-ordering done by compilers.

So in User space memory lets say I have

b = 0;
main(){

a = 10;
b = 20;
c = add(a,b);

}

Can the compiler reorder this code so that b = 20 assignment happens after c = add() is called.

Why we do not use barriers in this case ? Am I missing some fundamental here.

Does Virtual memory is exempted from any re ordering ?

Extending the Question further:

In Network driver:

1742         /*
1743          * Writing to TxStatus triggers a DMA transfer of the data
1744          * copied to tp->tx_buf[entry] above. Use a memory barrier
1745          * to make sure that the device sees the updated data.
1746          */
1747         wmb();
1748         RTL_W32_F (TxStatus0 + (entry * sizeof (u32)),
1749                    tp->tx_flag | max(len, (unsigned int)ETH_ZLEN));
1750 

When he says devices see the updated data... How to relate this with the multi threaded theory for usage of barriers.

RootPhoenix
  • 1,626
  • 1
  • 22
  • 40
  • Memory barriers are not just about compiler reordering. In a multithreaded program, different threads can see different orderings of memory accesses (for most ISAs). For compiler reordering, the reordering cannot be visible within the one thread but may be visible in other threads. E.g., given the function uses constants, the compiler could precompute `add(a,b)` and store the result first, then store `a` and `b` so another thread would see the change to `c` before the changes to `a` and `b` even on a sequentially consistent processor. –  Mar 22 '16 at 12:38
  • So mostly in kernel code, I see barriers being used mostly with memory accesses to devices or RAM, So where does multithreaded programming fit here? – RootPhoenix Mar 22 '16 at 13:17
  • 1
    It's a wide question, but basically - there is an implicit order in a single thread context, one that a compiler can and must preserve, but there's no implicit ordering between actions over different threads, so neither the compiler nor the hardware can impose one unless you tell them how to. The best they can do is decide on some random order, and make it appear consistent. – Leeor Mar 22 '16 at 14:09
  • re: your edit. Read the last paragraph of my answer. That's *exactly* the use case I was talking about: making sure that preceding stores happen (and will be visible to DMA) before triggering the DMA. – Peter Cordes Mar 24 '16 at 09:13

3 Answers3

4

Short answer

Memory barriers are used less frequently in user mode code than kernel mode code because user mode code tends to use higher level abstractions (for example pthread synchronization operations).

Additional details

There are two things to consider when analyzing the possible ordering of operations:

  1. What order the thread that is executing the code will see the operations in
  2. What order other threads will see the operations in

In your example the compiler cannot reorder b=20 to occur after c=add(a,b) because the c=add(a,b) operation uses the results of b=20. However, it may be possible for the compiler to reorder these operations so that other threads see the memory location associated with c change before the memory location associated with b changes.

Whether this would actually happen or not depends on the memory consistency model that is implemented by the hardware.

As for when the compiler might do reordering you could imagine adding another variable as follows:

b = 0;
main(){

a = 10;
b = 20;
d = 30;
c = add(a,b);

}

In this case the compiler would be free to move the d=30 assignment to occur after c=add(a,b).

However, this entire example is too simplistic. The program doesn't do anything and the compiler can eliminate all the operations and does not need to write anything to memory.

Addendum: Memory reordering example

In a multiprocessor environment multiple threads can see memory operations occur in different orders. The Intel Software Developer's Manual has some examples in Volume 3 section 8.2.3. I've copied a screenshot below that shows an example where loads and stores can be reordered. There is also a good blog post that provides some more detail about this example.

Loads reordered with earlier store to different locations

Community
  • 1
  • 1
Gabriel Southern
  • 9,602
  • 12
  • 56
  • 95
  • You mean to say since there is dependency on b, ordering won't happen for this thread, but I didn't understand the part where you are referring to other threads seeing it differently... Can you please point out at some good resources, ebooks where this can be understood in a ground up manner. – RootPhoenix Mar 24 '16 at 07:06
  • 1
    @Vatvaghul I added some more detail. You may be interested in looking at some of the blog posts from Preshing on Programming (http://preshing.com/archives/) related to this topic. Many are written in a way that is fairly accessible even without having a lot of background knowledge. – Gabriel Southern Mar 24 '16 at 16:45
2

The thread running the code will always act as if the effects of the source lines of its own code happened in program order. This is as if rule is what enables most compiler optimizations.

Within a single thread, out-of-order CPUs track dependencies to give a thread the illusion that all its instructions executed in program order. The globally-visible (to threads on other cores) effects may be seen out-of-order by other cores, though.

Memory barriers (as part of locking, or on their own) are only needed in code that interacts with other threads through shared memory.

Compilers can similarly do any reordering / hoisting they want, as long as the results are the same. The C++ memory model is very weak, so compile-time reordering is possible even when targeting an x86 CPU. (But of course not reordering that produces different results within the local thread.) C11 <stdatomic.h> and the equivalent C++11 std::atomic are the best way to tell the compiler about any ordering requirements you have for the global visibility of operations. On x86, this usually just results in putting store instructions in source order, but the default memory_order_seq_cst needs an MFENCE on each store to prevent StoreLoad reordering for full sequential consistency.

In kernel code, memory barriers are also common to make sure that stores to memory-mapped I/O registers happen in a required order. The reasoning is the same: to order the globally-visible effects on memory of a sequence of stores and loads. The difference is that the observer is an I/O device, not a thread on another CPU. The fact that cores interact with each other through a cache coherency protocol is irrelevant.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The NIC is an external device, but none of our driver code is run on the NIC right..it is only notified...so how does the compiler know about its ordering requirements? – RootPhoenix Mar 24 '16 at 17:37
  • 1
    @Vatvaghul: The compiler doesn't know, that's why you have to tell it. It's possible to have processes running *different* code interact via shared memory. The compiler controls the ordering of your code's actions. You can pretend the NIC is a CPU running it's own code that you're communicating with. – Peter Cordes Mar 24 '16 at 20:34
1

The compiler cannot reorder (nor can the runtime or the cpu) so that b=20 is after the c=add()since that would change the semantics of the method and that is not permissible. I would say that for the compiler (or runtime or cpu) to act as you describe would make the behaviour random, which would be a bad thing.

This restriction on reordering applies only within the thread executing the code. As @GabrielSouthern points out, the ordering of the stores becoming globally visible is not guaranteed, if a, b, and c are all global variables.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Erik
  • 2,013
  • 11
  • 17
  • So when does the compiler reorder ? – RootPhoenix Mar 22 '16 at 13:16
  • This is not correct. It is possible for the `c=add()` operation to be visible to other threads before `b=20` depending on the memory consistency model. It won't happen on x86 that has a strong consistency model (TSO), but it can happen on hardware that has a weaker consistency model. – Gabriel Southern Mar 22 '16 at 18:01
  • @GabrielSouthern: I'm pretty sure Erik is talking about within a single thread, which appears to be the source of the OP's confusion. I was going to edit this answer to make the point more clearly, but then it turned into my own answer so I just posted it separately. – Peter Cordes Mar 23 '16 at 03:35
  • 1
    @PeterCordes I agree with what you posted in your answer. I still think this answer is potentially misleading because usually questions about barriers relate to some inter thread/process communication. – Gabriel Southern Mar 23 '16 at 04:40
  • @GabrielSouthern: yup, I figured out a way to make that point with a minor edit. So now it's not misleading, and is just less complete than our answers. – Peter Cordes Mar 23 '16 at 04:48
  • @GabrielSouthern What I was after was that the reordering proposed in the OP would change the value of c to something that would be different without reordering and that cannot be permissible anywhere. That changes the semantics of the method. – Erik Mar 23 '16 at 08:03