std::atomic
solves 4 problems.
One is that load/store is atomic, meaning you don't get loads and stores intermixed so that for example you load 32bit from before a store and the other 32bit from after a store. Normally everything up to register size is naturally atomic in that sense on the CPU itself. Things might break with unaligned access, potentially only when the access crosses a cacheline. In std::atmoic<T>
implementations you will see the use of locks when the size of T
exceeds the size the CPU reads/writes atomically on it's own.
The other thing std::atomic
does is synchronize access between threads. Just because one thread writes data to a variable doesn't mean another thread sees that data appear instantly. The writing cpu puts the data into it's store buffer hoping it just gets overwritten again or adjacent memory gets written and the 2 writes can be combined. After a while the data goes to L1 cache where it can stay even longer, then L2 and L3. Depending on the architecture cache may or may not be shared between CPU cores. They also might not synchronize automatically. So when you want to access the same memory address from multiple cores you have to tell the CPU to synchronize the access with other cores.
The third thing has to with modern CPUs doing out-of-order execution and speculative execution. That means even if the code checks a variable and then reads a second variable the CPU might read the second variable first. If the first variable acts as a semaphore signaling the second variable is ready to be read then this can fail because the read happens before the data is ready. The std::atomic
adds barriers preventing the CPU to do these reorderings so reads and writes happen in a specific order in the hardware.
The fourth thing is much the same but for the compiler. std::atomic
prevents the compiler from reordering instructions across it. Or from optimizing multiple reads or writes into just one.
All of this std::atomic
does automatiocaly for you if you just use it without specifying any memory order. The default memory order is the strongest order.
But when you use
uint64_t a = x.load(std::memory_order_relaxed);
you tell the compiler to ignore most of the things:
Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed
So you instructed the compiler not to care about synchronizing with other threads or caches or to preserve the order the instructions are written. All you care about is that reads or writes are not broken up into 2 or more parts where you could get mixed data. The load
will get either the whole data from before the store
or the whole data from after the store
in the other thread. But it's completely undefined which of the two values you get. Which is what you get for all 64bit load/store for free so the code is identical.
Note: if you have multiple atomics then accessing one with a stronger memory order will synchronize both of them. So you can see code that will do one load with a strong order together with others with weak order. Same for groups of writes. This can speed up access. But it's hard to get right.