Is there any performance difference in just reading an atomic variable compared to a normal variable?

Question

int i = 0;
if(i == 10)  {...}  // [1]

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

Is the statement [1] any faster than the statements [2] & [3] in a multithreaded environment?
Assume that ai may or may not be written in another thread, when [2] & [3] are executing.

Add-on: Provided that accurate value of the underlying integer is not a necessity, which is the fastest way to read an atomic variable?

Yes, [1] should be faster. [2] requires fence or lock instructions, depending on the architecture. — Igor Tandetnik, Jan 03 '20 at 15:36
Depends on your system. Do you have a specific architecture you are targeting? — NathanOliver, Jan 03 '20 at 15:36
Also, if `i` can be read and written to in multiple threads you have a data race and undefined behavior which totaly erases any performance gains. — NathanOliver, Jan 03 '20 at 15:38
On my machine, `[1]` is about x40 faster than `[2]` when optimized. But that can vary greatly by platform and compiler (and compiler's optimizations). — Eljay, Jan 03 '20 at 15:46
@NathanOliver, the code is expected to run on various systems with Qt. Regarding your comment on UB, will it be only the accuracy of the data be affected OR it can crash a system (kind of UB)? — iammilind, Jan 03 '20 at 15:48
Generally it will be the you have no guarantee of the result you'll get UB, not a crash but it's UB so anything can happen. Generally what happens is that without the synchronization the compiler could apply optimizations that it couldn't otherwise because it "knows" that it's value cannot change (you told it there are no threads basically). For the atomic, if you don't care about the result but just want the safety you might get a speedup using `if(ai.load(std::memory_order_relaxed) == 10)` — NathanOliver, Jan 03 '20 at 15:54
The usual consequence if you don't use `atomic<>` when you should is stuff like [MCU programming - C++ O2 optimization breaks while loop](//electronics.stackexchange.com/a/387478) - a `while(!read){}` loop turns into `if(!ready) infinite_loop();` by hoisting the load. — Peter Cordes, Jan 03 '20 at 19:05
@iammilind The OS in general will prevent a crash of the system, unless your process has special privileges... But the program could really misbehave, because the compiler can assume that UB never happens. So f.ex. if a branch leads to unavoidable UB, the compiler can assume it's never taken. So the condition is assumed to be false in `if(cond) {unavoidable_UB}` That can be the case even if the compiler knows `cond` is a true constant. That can lead be funny compilations. — curiousguy, Jan 03 '20 at 20:12
Please refine the Q. What's an "accurate value"? A relatively recent one? Recent how? What is the comparison about? Changing a non atomic object in such way has UB. (At least make it volatile.) — curiousguy, Jan 03 '20 at 20:26

score 10 · Accepted Answer · edited Jan 03 '20 at 18:56

It depends on the architecture, but in general loads are cheap, paired with a store with a strict memory ordering can be expensive though.

On x86_64, loads and stores of up to 64-bits are atomic on their own (but read-modify-write is decidedly not).

As you have it, the default memory ordering in C++ is std::memory_order_seq_cst, which gives you sequential consistency, ie: there's some order that all threads will see loads/stores occurring in. To accomplish this on x86 (and indeed all multi-core systems) requires a memory fence on stores to ensure that loads occurring after the store read the new value.

Reading in this case does not require a memory fence on strongly-ordered x86, but writing does. On most weakly-ordered ISAs, even seq_cst reading would require some barrier instructions, but not a full barrier. If we look at this code:

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num = 12;
    if (num == 10) {
        return 0;
    }
    return 1;
}

compiled with -O3:

   0x0000000000000560 <+0>:     sub    $0x18,%rsp
   0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
   0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
   0x0000000000000572 <+18>:    xor    %eax,%eax
   0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
   0x000000000000057c <+28>:    mfence 
   0x000000000000057f <+31>:    mov    0x4(%rsp),%eax
   0x0000000000000583 <+35>:    cmp    $0xa,%eax
   0x0000000000000586 <+38>:    setne  %al
   0x0000000000000589 <+41>:    mov    0x8(%rsp),%rdx
   0x000000000000058e <+46>:    xor    %fs:0x28,%rdx
   0x0000000000000597 <+55>:    jne    0x5a1 <main+65>
   0x0000000000000599 <+57>:    movzbl %al,%eax
   0x000000000000059c <+60>:    add    $0x18,%rsp
   0x00000000000005a0 <+64>:    retq

We can see that the read from the atomic variable at +31 doesn't require anything special, but because we wrote to the atomic at +20, the compiler had to insert an mfence instruction afterwards which ensures that this thread waits for its store to become visible before doing any later loads. This is expensive, stalling this core until the store buffer drains. (Out-of-order exec of later non-memory instructions is still possible on some x86 CPUs.)

If we instead us a weaker ordering (such as std::memory_order_release) on the write:

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num.store(12, std::memory_order_release);
    if (num == 10) {
        return 0;
    }
    return 1;
}

Then on x86 we don't need the fence:

   0x0000000000000560 <+0>:     sub    $0x18,%rsp
   0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
   0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
   0x0000000000000572 <+18>:    xor    %eax,%eax
   0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
   0x000000000000057c <+28>:    mov    0x4(%rsp),%eax
   0x0000000000000580 <+32>:    cmp    $0xa,%eax
   0x0000000000000583 <+35>:    setne  %al
   0x0000000000000586 <+38>:    mov    0x8(%rsp),%rdx
   0x000000000000058b <+43>:    xor    %fs:0x28,%rdx
   0x0000000000000594 <+52>:    jne    0x59e <main+62>
   0x0000000000000596 <+54>:    movzbl %al,%eax
   0x0000000000000599 <+57>:    add    $0x18,%rsp
   0x000000000000059d <+61>:    retq

Note though, if we compile this same code for AArch64:

   0x0000000000400530 <+0>:     stp  x29, x30, [sp,#-32]!
   0x0000000000400534 <+4>:     adrp x0, 0x411000
   0x0000000000400538 <+8>:     add  x0, x0, #0x30
   0x000000000040053c <+12>:    mov  x2, #0xc
   0x0000000000400540 <+16>:    mov  x29, sp
   0x0000000000400544 <+20>:    ldr  x1, [x0]
   0x0000000000400548 <+24>:    str  x1, [x29,#24]
   0x000000000040054c <+28>:    mov  x1, #0x0
   0x0000000000400550 <+32>:    add  x1, x29, #0x10
   0x0000000000400554 <+36>:    stlr x2, [x1]
   0x0000000000400558 <+40>:    ldar x2, [x1]
   0x000000000040055c <+44>:    ldr  x3, [x29,#24]
   0x0000000000400560 <+48>:    ldr  x1, [x0]
   0x0000000000400564 <+52>:    eor  x1, x3, x1
   0x0000000000400568 <+56>:    cbnz x1, 0x40057c <main+76>
   0x000000000040056c <+60>:    cmp  x2, #0xa
   0x0000000000400570 <+64>:    cset w0, ne
   0x0000000000400574 <+68>:    ldp  x29, x30, [sp],#32
   0x0000000000400578 <+72>:    ret

When we write to the variable at +36, we use a Store-Release instruction (stlr), and loading at +40 uses a Load-Acquire (ldar). These each provide a partial memory fence (and together form a full fence).

You should only use atomic when you have to reason about the access ordering on the variable. To answer your add-on question, use std::memory_order_relaxed for the memory to read the atomic, with no guarantees on synchronizing with writes. Only atomicity is guaranteed.

*[mo_seq_cst] requires a memory fence on stores to ensure that changes are visible.* Not exactly. It requires a memory fence to ensure that later loads from this thread don't happen until after the preceding store is globally visible. Stores *always* become visible on their own, barriers just make the current thread wait for them. — Peter Cordes, Jan 03 '20 at 17:25
@PeterCordes I knew someone'd have quibbles with my wording =D, but you're right, I'll edit. — gct, Jan 03 '20 at 18:10
Thanks; due to widespread misconceptions about how CPUs and atomics work under the hood (e.g. that CPU data caches can be out of sync; really it's that compilers "cache" values in *registers*), I think it's important to nitpick details like this. — Peter Cordes, Jan 03 '20 at 18:49
@PeterCordes I agree completely the devil is in the details here — gct, Jan 03 '20 at 18:51
Looks like you only changed one of the places that needed it; I changed the one I was commenting about which also had the suspicious phrase about caches taking "time to synchronize". Time to drain the store buffer can include time for RFO requests for cache lines that aren't owned by this core, but cache never gets out of sync; that's the whole point of a coherency protocol like MESI. — Peter Cordes, Jan 03 '20 at 19:00
BTW, I wouldn't have written a `main`, I'd have just written a function that takes a reference or pointer to an `atomic`. Also, I would have compiled with `-fno-stack-protector` to declutter the asm. Not sure why GCC would be making a stack cookie when the only local is an `atomic`, but `mov %fs:0x28,%rax` is there nonetheless. Maybe `main` is special? (https://godbolt.org/ doesn't enable `-fstack-protector-strong` by default; I normally compile there to copy/paste to SO.) — Peter Cordes, Jan 03 '20 at 19:03

rustyx · Answer 2 · 2020-01-03T19:03:03.077

The 3 cases presented have different semantics, so it may be pointless to reason about their relative performance, unless the value is never written after the threads have started.

Case 1:

int i = 0;
if(i == 10)  {...}  // may actually be optimized away since `i` is clearly 0 now

If i is accessed by more than one thread, which includes a write, the behavior is undefined.

In the absence of synchronization, the compiler is free to assume no other thread can modify i, and may reorder/optimize access to it. For example, it may load i into a register once and never re-read it from memory, or it may hoist writes out of the loop and only write once at the end.

Case 2:

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]

By default reads and writes to an atomic are done in std::memory_order_seq_cst (sequentially-consistent) memory order. This means that not only are reads/writes to ai atomic, but they are also visible to other threads in a timely manner, including any other variable's reads/writes before/after it.

So reading/writing an atomic acts as a memory fence. This however, is much slower since (1) an SMP system must synchronize caches between processors and (2) the compiler has much less freedom in optimizing code around the atomic access.

Case 3:

std::atomic<int> ai{0};
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

This mode allows and guarantees atomicity of ai reads/writes only. So the compiler is again free to reorder access to it, and only guarantees that writes are visible to other threads in a reasonable amount of time.

It's applicability is very limited, as it makes it very hard to reason about the order of events in a program. For example

std::atomic<int> ai{0}, aj{0};

// thread 1
aj.store(1, std::memory_order_relaxed);
ai.store(10, std::memory_order_relaxed);

// thread 2
if(ai.load(std::memory_order_relaxed) == 10) {
  aj.fetch_add(1, std::memory_order_relaxed);
  // is aj 1 or 2 now??? no way to tell.
}

This mode is potentially (and often) slower than case 1 since the compiler must ensure each read/write actually goes out to cache/RAM, but is faster than case 2, since it's still possible to optimize other variables around it.

For more details about atomics and memory ordering, see Herb Sutter's excellent atomic<> weapons talk.

`seq_cst` atomic loads don't have to be full memory barriers. Usually only the store side of `seq_cst` atomics is given the burden of a full barrier, exactly so loads can be cheap (https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html). AArch64 is fairly unique: `stlr` (release-store) can't pass `ldar` (acquire-load), so you can get seq_cst without an actual full barrier anywhere. And great performance if you don't do a seq_cst load soon after a seq_cst store (because that would force draining the store buffer). i.e. `stlr` is a sequential-release, however it's implemented. — Peter Cordes, Jan 03 '20 at 17:19
*ensure each read/write actually goes out to RAM.* To memory/RAM yes, but not *D*RAM. It doesn't have to flush or bypass cache, which is a common misconception, so I'd phrase it differently to discourage that misinterpretation of what you said. Cache is coherent, so just making sure the load or store happens in the asm, instead of being optimized away or keeping a value in a register, is all that's needed. `mo_relaxed` is *similar* to what you get from `volatile`, in case that helps anyone understand the optimization implications. — Peter Cordes, Jan 03 '20 at 17:23
"_`std::memory_order_seq_cst` ... means that ... they are also visible to other threads in a timely manner_" And other memory orders don't imply that? — curiousguy, Jan 03 '20 at 22:22

score 1 · Answer 3 · answered Jan 03 '20 at 19:08

Regarding your comment on UB, will it be only the accuracy of the data be affected OR it can crash a system (kind of UB)?

The usual consequence if you don't use atomic<> when you should for reads is stuff like MCU programming - C++ O2 optimization breaks while loop

e.g. a while(!read){} loop turns into if(!ready) infinite_loop(); by hoisting the load.

Just don't do it; manually hoist the atomic load in the source if / when that's ok, like int localtmp = shared_var.load(std::memory_order_relaxed);

Is there any performance difference in just reading an atomic variable compared to a normal variable?

3 Answers3