I created a simple demo to show that unaligned memory stores/loads are generally not atomic on x86_64 and ARM64 architectures. This demo consists of a C++ program that creates two threads — the first one billion times calls a function called store
, the second one does the same with a function called load
. The source code of the program is here:
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <thread>
extern "C" void store(void*);
extern "C" uint16_t load(void*);
alignas(64) char buf[65];
char* ptr;
static long n = 1'000'000'000L;
void f1()
{
for (long i = 0; i < n; i++)
store(ptr);
}
void f2()
{
long v0x0000 = 0;
long v0x0101 = 0;
long v0x0100 = 0;
long v0x0001 = 0;
long other = 0;
for (long i = 0; i < n; i++)
{
uint16_t a = load(ptr);
if (a == 0x0000) v0x0000++;
else if (a == 0x0101) v0x0101++;
else if (a == 0x0100) v0x0100++;
else if (a == 0x0001) v0x0001++;
else other++;
}
std::cout << "0x0000: " << v0x0000 << std::endl;
std::cout << "0x0101: " << v0x0101 << std::endl;
std::cout << "0x0100: " << v0x0100 << std::endl;
std::cout << "0x0001: " << v0x0001 << std::endl;
std::cout << "other: " << other << std::endl;
}
int main(int arc, char* argv[])
{
int offset = std::atoi(argv[1]);
ptr = buf + offset;
std::thread t1(f1);
std::thread t2(f2);
t1.join();
t2.join();
}
The store
and load
functions are defined separately in the assembly source files. For x86_64 as follows:
.intel_syntax noprefix
.global store
.global load
.text
store:
mov eax, 0
mov WORD PTR [rdi], ax
mov eax, 0x0101
mov WORD PTR [rdi], ax
ret
load:
movzx eax, WORD PTR [rdi]
ret
And, for ARM64 as follows:
.global store
.global load
.text
store:
mov w1, 0x0000
strh w1, [x0]
mov w1, 0x0101
strh w1, [x0]
ret
load:
ldrh w0, [x0]
ret
When I run the program, everything works as expected. When I pass offset 0, the stores/loads are aligned and just the values 0x0000
and 0x0101
are observed in the reading thread. When I pass offset 63, the stores/loads are unaligned and cross the cache line boundary, and the values 0x0100
and 0x0001
are observed as well. This holds for both architectures.
However, I noticed that there is a big difference in the execution times of these test runs. Some typical times I observed:
- x86_64 + offset 0 (aligned): 6.9 [s]
- x86_64 + offset 63 (unaligned): 28.3 [s]
- ARM64 + offset 0 (aligned): 6.8 [s]
- ARM64 + offset 63 (unaligned): 9.2 [s]
On x86_64, when two cache lines are involved in unaligned cases, the runtime is several times slower. But on ARM64, the runtime is slower only slightly. I wonder what makes the difference in this behavior between both architectures. (I am not much familiar with cache coherency mechanisms.)
Particular processors for experiments were Intel Xeon E5-2680 v3 and Cortex-A72. The former was in a dual-socket server, but I restricted both threads to a single socket only (by taskset
or numactl
). The latter was in Raspberry Pi 4 device. Both systems run Linux plus I used GCC for builds.