7

Why do GCC and Clang generates so different asm for this code (x86_64, -O3 -std=c++17)?

#include <atomic>

int global_var = 0;

int foo_seq_cst(int a)
{
    std::atomic<int> ia;
    ia.store(global_var + a, std::memory_order_seq_cst);
    return ia.load(std::memory_order_seq_cst);
}

int foo_relaxed(int a)
{
    std::atomic<int> ia;
    ia.store(global_var + a, std::memory_order_relaxed);
    return ia.load(std::memory_order_relaxed);
}

GCC 9.1:

foo_seq_cst(int):
        add     edi, DWORD PTR global_var[rip]
        mov     DWORD PTR [rsp-4], edi
        mfence
        mov     eax, DWORD PTR [rsp-4]
        ret
foo_relaxed(int):
        add     edi, DWORD PTR global_var[rip]
        mov     DWORD PTR [rsp-4], edi
        mov     eax, DWORD PTR [rsp-4]
        ret

Clang 8.0:

foo_seq_cst(int):                       # @foo_seq_cst(int)
        mov     eax, edi
        add     eax, dword ptr [rip + global_var]
        ret
foo_relaxed(int):                       # @foo_relaxed(int)
        mov     eax, edi
        add     eax, dword ptr [rip + global_var]
        ret

I suspect that mfence here is an overkill, am I right? Or Clang generates code that can leads to bugs in some cases?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
kpdev
  • 610
  • 1
  • 7
  • 20
  • 1
    godbolt comparison https://gcc.godbolt.org/z/GFCEY3 – kpdev May 19 '19 at 06:31
  • 8
    It would seem that since the atomic is a local variable, clang recognizes that only one thread has access to it and avoids generating code for the atomic at all. – Joachim Isaksson May 19 '19 at 06:54
  • So GCC do not optimize well and mfence can be thrown away? – kpdev May 19 '19 at 07:01
  • Yes, if you're like here the single thread working with a variable, there is no need for an mfence. If you force clang to generate code for the variable anyway, it will correctly use a memory fence ("built into" the xchg instruction) https://gcc.godbolt.org/z/_-XLEs – Joachim Isaksson May 19 '19 at 07:07
  • 2
    GCC doesn't get atomics at the core language level, they are treated as library function calls, think `printf`, never removed. Clang generates expected code. – curiousguy May 20 '19 at 01:12
  • [tag:multithreading] doesn't seem relevant as there is only one thread of execution, the thread of execution. – curiousguy May 20 '19 at 01:20
  • See https://stackoverflow.com/q/56046501/963864 – curiousguy May 20 '19 at 01:25
  • Clang still optimizes [very poorly](https://gcc.godbolt.org/z/YaUuJf) redundant writes: `mov dword ptr [rip + ia], edi ; xchg dword ptr [rip + ia], edi` – curiousguy May 20 '19 at 01:30
  • 1
    Maybe if you could explain why you would want a meaningless pseudo release operation to produce a fence we could explain why the intuition is incorrect. Releasing a shooting to the whole world that you have accomplished something and you set a flag to tell that. Who are you shooting at and what flag are you setting? – curiousguy May 20 '19 at 01:45

1 Answers1

7

A more realistic example:

#include <atomic>

std::atomic<int> a;

void foo_seq_cst(int b) {
    a = b;
}

void foo_relaxed(int b) {
    a.store(b, std::memory_order_relaxed);
}

gcc-9.1:

foo_seq_cst(int):
        mov     DWORD PTR a[rip], edi
        mfence
        ret
foo_relaxed(int):
        mov     DWORD PTR a[rip], edi
        ret

clang-8.0:

foo_seq_cst(int):                       # @foo_seq_cst(int)
        xchg    dword ptr [rip + a], edi
        ret
foo_relaxed(int):                       # @foo_relaxed(int)
        mov     dword ptr [rip + a], edi
        ret

gcc uses mfence, whereas clang uses xchg for std::memory_order_seq_cst.

xchg implies lock prefix. Both lock and mfence satisfy the requirements of std::memory_order_seq_cst, which is no reordering and total order.

From Intel 64 and IA-32 Architectures Software Developer’s Manual:

MFENCE—Memory Fence

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.

8.2.3.8 Locked Instructions Have a Total Order

The memory-ordering model ensures that all processors agree on a single execution order of all locked instructions, including those that are larger than 8 bytes or are not naturally aligned.

8.2.3.9 Loads and Stores Are Not Reordered with Locked Instructions

The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later.

lock was benchmarked to be 2-3x faster than mfence and Linux switched from mfence to lock where possible.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271