Different runtimes for aligned vs unaligned memory access on x86_64 and ARM64

Question

I created a simple demo to show that unaligned memory stores/loads are generally not atomic on x86_64 and ARM64 architectures. This demo consists of a C++ program that creates two threads — the first one billion times calls a function called store, the second one does the same with a function called load. The source code of the program is here:

#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <thread>

extern "C" void store(void*);
extern "C" uint16_t load(void*);

alignas(64) char buf[65];
char* ptr;

static long n = 1'000'000'000L;

void f1()
{
  for (long i = 0; i < n; i++)
    store(ptr);
}

void f2()
{
  long v0x0000 = 0;
  long v0x0101 = 0;
  long v0x0100 = 0;
  long v0x0001 = 0;
  long other = 0;

  for (long i = 0; i < n; i++)
  {
    uint16_t a = load(ptr);

    if (a == 0x0000) v0x0000++;
    else if (a == 0x0101) v0x0101++;
    else if (a == 0x0100) v0x0100++;
    else if (a == 0x0001) v0x0001++;
    else other++;
  }

  std::cout << "0x0000: " << v0x0000 << std::endl;
  std::cout << "0x0101: " << v0x0101 << std::endl;
  std::cout << "0x0100: " << v0x0100 << std::endl;
  std::cout << "0x0001: " << v0x0001 << std::endl;
  std::cout << "other: " << other << std::endl;
}

int main(int arc, char* argv[])
{
  int offset = std::atoi(argv[1]);
  ptr = buf + offset;

  std::thread t1(f1);
  std::thread t2(f2);

  t1.join();
  t2.join();
}

The store and load functions are defined separately in the assembly source files. For x86_64 as follows:

    .intel_syntax noprefix 

    .global store
    .global load

    .text

store:
    mov eax, 0
    mov WORD PTR [rdi], ax
    mov eax, 0x0101
    mov WORD PTR [rdi], ax
    ret

load:
    movzx eax, WORD PTR [rdi]
    ret

And, for ARM64 as follows:

    .global store
    .global load

    .text

store:
    mov w1, 0x0000
    strh w1, [x0]
    mov w1, 0x0101
    strh w1, [x0]
    ret

load:
    ldrh w0, [x0]
    ret

When I run the program, everything works as expected. When I pass offset 0, the stores/loads are aligned and just the values 0x0000 and 0x0101 are observed in the reading thread. When I pass offset 63, the stores/loads are unaligned and cross the cache line boundary, and the values 0x0100 and 0x0001 are observed as well. This holds for both architectures.

However, I noticed that there is a big difference in the execution times of these test runs. Some typical times I observed:

x86_64 + offset 0 (aligned): 6.9 [s]
x86_64 + offset 63 (unaligned): 28.3 [s]
ARM64 + offset 0 (aligned): 6.8 [s]
ARM64 + offset 63 (unaligned): 9.2 [s]

On x86_64, when two cache lines are involved in unaligned cases, the runtime is several times slower. But on ARM64, the runtime is slower only slightly. I wonder what makes the difference in this behavior between both architectures. (I am not much familiar with cache coherency mechanisms.)

Particular processors for experiments were Intel Xeon E5-2680 v3 and Cortex-A72. The former was in a dual-socket server, but I restricted both threads to a single socket only (by taskset or numactl). The latter was in Raspberry Pi 4 device. Both systems run Linux plus I used GCC for builds.

Why make `ptr` itself a global that `f2` has to reload after calling the non-inline function `load`? Seems like unnecessary extra overhead, although possibly it's involved in reproducing this performance effect because of x86's strong memory ordering. (I'd look for memory order mis-speculation pipeline nukes: [Why flush the pipeline for Memory Order Violation caused by other logical processors?](https://stackoverflow.com/q/55563077) ) — Peter Cordes, Dec 29 '20 at 11:57
You forgot to include a `main`. I was going to copy/paste this and profile it, but I don't feel like writing a main to call these functions. — Peter Cordes, Dec 29 '20 at 12:22
@PeterCordes Sorry, my bad, added `main` in edit. You can also find the code [here](https://github.com/DanielLangr/ni-mcc-code/tree/main/01_unaligned_store) with `Makefile`. I will also try to make `ptr` a local variable. Didn't realize that it may even be stored in the same cache line as a part of `buf`. Good catch. — Daniel Langr, Dec 29 '20 at 12:26
@PeterCordes I made `ptr` a parameter of `f1` and `f2` functions. The times for 3 executions with offset 0: 5.5, 5.3, 5.4; with offset 63 varies between 18 and 28 seconds. — Daniel Langr, Dec 29 '20 at 12:33
Moreover, I tried to change `f2` such that there was no branching in the loop by storing counters in an array and incrementing an element at index `a` (returned by `load`). But do not observe any noticeable difference. — Daniel Langr, Dec 29 '20 at 12:39
Two major factors immediately come to mind. (1) The Cortex-A72 has a two-level cache hierarchy compared to 3-level in the Xeon E5-2680v3, so it takes a lot more time to basically send the line(s) back and forth between the two cores on the Xeon than on the Cortex, unless the two threads are running on sibling logical cores which I think is not the case here. (2) When running on different physical cores, the Xeon is going to be much slower in terms of core cycles for any offset. The reason that they appear to have about the same execution time... — Hadi Brais, Dec 31 '20 at 23:34
... for the case of zero offset is because the E5-2680v3 can and probably is running at a much higher core and uncore frequencies than the Cortex-A72. It's easy to prove my two guesses using the relevant hardware performance monitoring events available on the two processors. — Hadi Brais, Dec 31 '20 at 23:34

Different runtimes for aligned vs unaligned memory access on x86_64 and ARM64

0 Answers0