51

asm volatile("": : :"memory") is often used as a memory barrier (e.g. as seen in the Linux kernel barrier macro).

This sounds similar to what the GCC builtin __sync_synchronize does.

Are these two similar?

If not, what are the differences, and when would one used over the other ?

Lii
  • 11,553
  • 8
  • 64
  • 88

2 Answers2

61

There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It's called a SW barrier.

The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point.

Here's another good explanation:

Types of Memory Barriers

As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a memory barrier. A memory barrier that affects both the compiler and the processor is a hardware memory barrier, and a memory barrier that only affects the compiler is a software memory barrier.

In addition to hardware and software memory barriers, a memory barrier can be restricted to memory reads, memory writes, or both. A memory barrier that affects both reads and writes is a full memory barrier.

There is also a class of memory barrier that is specific to multi-processor environments. The name of these memory barriers are prefixed with "smp". On a multi-processor system, these barriers are hardware memory barriers and on uni-processor systems, they are software memory barriers.

The barrier() macro is the only software memory barrier, and it is a full memory barrier. All other memory barriers in the Linux kernel are hardware barriers. A hardware memory barrier is an implied software barrier.

An example for when SW barrier is useful: consider the following code -

for (i = 0; i < N; ++i) {
    a[i]++;
}

This simple loop, compiled with optimizations, would most likely be unrolled and vectorized. Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:

400420:       66 0f 6f 00             movdqa (%rax),%xmm0
400424:       48 83 c0 10             add    $0x10,%rax
400428:       66 0f fe c1             paddd  %xmm1,%xmm0
40042c:       66 0f 7f 40 f0          movdqa %xmm0,0xfffffffffffffff0(%rax)
400431:       48 39 d0                cmp    %rdx,%rax
400434:       75 ea                   jne    400420 <main+0x30>

However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop:

400418:       83 00 01                addl   $0x1,(%rax)
40041b:       48 83 c0 04             add    $0x4,%rax
40041f:       48 39 d0                cmp    %rdx,%rax
400422:       75 f4                   jne    400418 <main+0x28>

However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • Thanks - this certainly clears up a few things. However, from this I cannot see when a software/compiler only barrier could be useful. When would one not need a hardware barrier as well ? –  Nov 13 '13 at 22:14
  • I see your edit for an SW barrier example, but still have a hard time grasping why that would be desired. I can picture preventing auto-vectorization to situationally be useful, but other than that in what kind of situations would I want to move the load inside the loop - if e.g. this was used for a spin lock implementation, or a read-copy-update mechanism, wouldn't a hardware memory barrier pretty much always be needed as well ? –  Nov 13 '13 at 22:41
  • @user964970 - Changed the example to a simpler one (with assembly output), hope that's more understandable now. The loop-invariant variable didn't defer to this scheme so I left it out of the example, but in theory it might be useful to prevent moving it out of the loop in case some other thread might update it in the middle of the loop. – Leeor Nov 13 '13 at 22:47
  • @user964970 - In a single-core environment (assuming no interaction with DMA data) the only re-ordering you need to be concerned with is that of the compiler. That is because, otherwise, all processors guarantee sequential semantics of operations, i.e., effects equivalent to instruction completion in program order. The later would be enough for the programmer not to care even about compiler reordering, in the case of single-core w/o DMA-data-interaction, if it weren't for interrupts! Of course, this includes context switches. – kavadias May 25 '16 at 14:44
  • 1
    You only need to care about hardware memory reordering and barriers, on a multi-core/multi-processor system. You also need to care about hardware memory reordering and barriers when communicating directly with memory-mapped device registers, or to determine DMA completion (when, in some hardware implementations, this is possible from the cpu). – kavadias May 25 '16 at 14:45
  • The broader case that you need either kind of memory barriers (SW or HW) is when using some form of signaling among cpu threads (including locks) or cpu and device, to enforce cpu code sequential semantics. Of course, the problem is that we are not always conscious of the signaling we use or the significance of sequential semantics for our thread or cpu-device interactions. – kavadias May 26 '16 at 11:53
  • 1
    Let me give you a case where a SW barrier is useful, on single-core, no multi-threading : I am optimising a program on a microcontroller without HW barrier. I have instrumented my code with many start/stop measuring points, but I am facing the problem that the barrier of the measure is too shallow : gcc will move stuff out or inside the start/stop measure, skewing the results. I need a proper SW barrier for both start measure and stop measure operation. – Philippe F Jan 22 '19 at 10:23
  • The discussion misses the case of ARM processors, where individual CPUs can do out-of-order reads and writes as part of execution scheduling. Completely horrible in C/C++ where the volatile keyword only affects code generation and not instruction scheduling , per the standards. (C# and Java both do the right thing). It would be nice to reflect how these features work on ARM processors. – Robin Davies Jan 22 '22 at 02:37
6

A comment on the usefulness of SW-only barriers:

On some micro-controllers, and other embedded platforms, you may have multitasking, but no cache system or cache latency, and hence no HW barrier instructions. So you need to do things like SW spin-locks. The SW barrier prevents compiler optimizations (read/write combining and reordering) in these algorithms.

Sagar
  • 9,456
  • 6
  • 54
  • 96
Ivar Svendsen
  • 81
  • 1
  • 3
  • 1
    Why would you not use the HW-barrier instructions anyway? It will (presumably) do the right thing on your monstrous little processor, and offer some hope of being portable on other processors for free? – Robin Davies Jan 22 '22 at 02:45
  • @RobinDavies The answer to that question is rather simple: There are no such instructions on these processors. – Jonathan S. Nov 16 '22 at 23:12
  • @JonathanS. If there are no such instructions on these processors, one would assume that the intrinsics won't generate any code (while provided barriers to compiler optimization) – Robin Davies Jan 05 '23 at 18:15
  • @RobinDavies That's correct. The barrier will not emit any code, but it's still very much necessary to prevent the compiler from messing up. x86-64 is actually one such architecture that doesn't need hardware memory barriers in most cases (it's fully coherent for data accesses), but barriers are still needed to keep the compiler in check. – Jonathan S. Jan 05 '23 at 19:07