ARM64 64 bit load/store data race

Question

According to this, a 64 bit load/store is considered to be an atomic access on arm64. Given this, is the following program still considered to have a data race (and thus can exhibit UB) when compiled for arm64 (ignore ordering with respect to other memory accesses)

uint64_t x;

// Thread 1
void f()
{
  uint64_t a = x;
}

// Thread 2
void g()
{
  x = 1;
}

If instead I switch this to using

std::atomic<uint64_t> x{};

// Thread 1
void f()
{
  uint64_t a = x.load(std::memory_order_relaxed);
}

// Thread 2
void g()
{
  x.store(1, std::memory_order_relaxed);
}

Is the second program considered data race free?

On arm64, it looks like the compiler ends up generating the same instruction for a normal 64 bit load/store and a load/store of an atomic with memory_order_relaxed, so what's the difference?

The compiler is allowed to perform memory reordering/elision optimizations in the v1 code, but not in the v2 code. — Raymond Chen, Jun 24 '22 at 23:30
@RaymondChen Reordering (between the atomic variable and other ones) is actually possible in the second code because the memory order is relaxed. The only restriction is the non reordering of the same atomic variable. Put it shortly, there is no memory barrier. I agree for the elision. — Jérôme Richard, Jun 25 '22 at 00:04

score 3 · Answer 1 · answered Jun 25 '22 at 06:16

std::atomic solves 4 problems.

One is that load/store is atomic, meaning you don't get loads and stores intermixed so that for example you load 32bit from before a store and the other 32bit from after a store. Normally everything up to register size is naturally atomic in that sense on the CPU itself. Things might break with unaligned access, potentially only when the access crosses a cacheline. In std::atmoic<T> implementations you will see the use of locks when the size of T exceeds the size the CPU reads/writes atomically on it's own.

The other thing std::atomic does is synchronize access between threads. Just because one thread writes data to a variable doesn't mean another thread sees that data appear instantly. The writing cpu puts the data into it's store buffer hoping it just gets overwritten again or adjacent memory gets written and the 2 writes can be combined. After a while the data goes to L1 cache where it can stay even longer, then L2 and L3. Depending on the architecture cache may or may not be shared between CPU cores. They also might not synchronize automatically. So when you want to access the same memory address from multiple cores you have to tell the CPU to synchronize the access with other cores.

The third thing has to with modern CPUs doing out-of-order execution and speculative execution. That means even if the code checks a variable and then reads a second variable the CPU might read the second variable first. If the first variable acts as a semaphore signaling the second variable is ready to be read then this can fail because the read happens before the data is ready. The std::atomic adds barriers preventing the CPU to do these reorderings so reads and writes happen in a specific order in the hardware.

The fourth thing is much the same but for the compiler. std::atomic prevents the compiler from reordering instructions across it. Or from optimizing multiple reads or writes into just one.

All of this std::atomic does automatiocaly for you if you just use it without specifying any memory order. The default memory order is the strongest order.

But when you use

uint64_t a = x.load(std::memory_order_relaxed);

you tell the compiler to ignore most of the things:

Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed

So you instructed the compiler not to care about synchronizing with other threads or caches or to preserve the order the instructions are written. All you care about is that reads or writes are not broken up into 2 or more parts where you could get mixed data. The load will get either the whole data from before the store or the whole data from after the store in the other thread. But it's completely undefined which of the two values you get. Which is what you get for all 64bit load/store for free so the code is identical.

Note: if you have multiple atomics then accessing one with a stronger memory order will synchronize both of them. So you can see code that will do one load with a strong order together with others with weak order. Same for groups of writes. This can speed up access. But it's hard to get right.

score 2 · Accepted Answer · answered Jun 24 '22 at 23:28

Whether or not an access is a data race in the sense of the C++ language standard is independent of the underlying hardware. The language has its own memory model and even if a straight-forward compilation to the target architecture would be free of problems, the compiler may still optimize based on the assumption that the program is free of data races in the sense of the C++ memory model.

Accessing a non-atomic in two threads without synchronization with one of them being a write is always a data race in the C++ model. So yes, the first program has a data race and therefore undefined behavior.

In the second program the object is an atomic, so there cannot be a data race.

ARM64 64 bit load/store data race

2 Answers2