Why is GCC std::atomic increment generating inefficient non-atomic assembly?

Question

I've been using gcc's Intel-compatible builtins (like __sync_fetch_and_add) for quite some time, using my own atomic template. The "__sync" functions are now officially considered "legacy".

C++11 supports std::atomic<> and its descendants, so it seems reasonable to use that instead, since it makes my code standard compliant, and the compiler will produce the best code either way, in a platform independent manner, that is almost too good to be true.
Incidentally, I'd only have to text-replace atomic with std::atomic, too. There's a lot in std::atomic (re: memory models) that I don't really need, but default parameters take care of that.

Now for the bad news. As it turns out, the generated code is, from what I can tell, ... utter crap, and not even atomic at all. Even a minimum example that increments a single atomic variable and outputs it has no fewer than 5 non-inlined function calls to ___atomic_flag_for_address, ___atomic_flag_wait_explicit, and __atomic_flag_clear_explicit (fully optimized), and on the other hand, there is not a single atomic instruction in the generated executable.

What gives? There is of course always the possibility of a compiler bug, but with the huge number of reviewers and users, such rather drastic things are generally unlikely to go unnoticed. Which means, this is probably not a bug, but intended behaviour.

What is the "rationale" behind so many function calls, and how is atomicity implemented without atomicity?

As-simple-as-it-can-get example:

#include <atomic>

int main()
{
    std::atomic_int a(5);
    ++a;
    __builtin_printf("%d", (int)a);
    return 0;
}

produces the following .s:

movl    $5, 28(%esp)     #, a._M_i
movl    %eax, (%esp)     # tmp64,
call    ___atomic_flag_for_address   #
movl    $5, 4(%esp)  #,
movl    %eax, %ebx   #, __g
movl    %eax, (%esp)     # __g,
call    ___atomic_flag_wait_explicit     #
movl    %ebx, (%esp)     # __g,
addl    $1, 28(%esp)     #, MEM[(__i_type *)&a]
movl    $5, 4(%esp)  #,
call    _atomic_flag_clear_explicit  #
movl    %ebx, (%esp)     # __g,
movl    $5, 4(%esp)  #,
call    ___atomic_flag_wait_explicit     #
movl    28(%esp), %esi   # MEM[(const __i_type *)&a], __r
movl    %ebx, (%esp)     # __g,
movl    $5, 4(%esp)  #,
call    _atomic_flag_clear_explicit  #
movl    $LC0, (%esp)     #,
movl    %esi, 4(%esp)    # __r,
call    _printf  #
(...)
.def    ___atomic_flag_for_address; .scl    2;  .type   32; .endef
.def    ___atomic_flag_wait_explicit;   .scl    2;  .type   32; .endef
.def    _atomic_flag_clear_explicit;    .scl    2;  .type   32; .endef

... and the mentioned functions look e.g. like this in objdump:

004013c4 <__atomic_flag_for_address>:
mov    0x4(%esp),%edx
mov    %edx,%ecx
shr    $0x2,%ecx
mov    %edx,%eax
shl    $0x4,%eax
add    %ecx,%eax
add    %edx,%eax
mov    %eax,%ecx
shr    $0x7,%ecx
mov    %eax,%edx
shl    $0x5,%edx
add    %ecx,%edx
add    %edx,%eax
mov    %eax,%edx
shr    $0x11,%edx
add    %edx,%eax
and    $0xf,%eax
add    $0x405020,%eax
ret

The others are somewhat simpler, but I don't find a single instruction that would really be atomic (other than some spurious xchg which are atomic on X86, but these seem to be rather NOP/padding, since it's xchg %ax,%ax following ret).

I'm absolutely not sure what such a rather complicated function is needed for, and how it's meant to make anything atomic.

What version of GCC are you using? Can you show a small program that results in such poor code? I'm running a 4.7 snapshot from last month and it seems to produce decent code, with `lock` instructions in it. — R. Martinho Fernandes, Nov 14 '11 at 12:12
The memory model that you "don't need" comes to mind as a possible culprit. What does your code look like? Also what do you mean with the last sentence: "How is atomicity implemented without atomicity"? — jalf, Nov 14 '11 at 12:13
@R.MartinhoFernandes: Using gcc 4.6.1, `__sync_fetch_and_add` produces `LOCK XADD` or `LOCK INC` if you don't consume the output (just as expected), whereas something like `std::atomic_int a(5); ++a;` produces said 5 function calls. I'll edit and provide `.s` and `objdump` output. — Damon, Nov 14 '11 at 13:00
@R.MartinhoFernandes: Re: "memory models", indeed, my bad... wrong word. I meant "memory ordering". I just need atomic increment and decrement of counters, none of the comlicated esoteric stuff. — Damon, Nov 14 '11 at 13:03
@jalf: What I mean is, how do you increment an integer _atomically_ just by calling a function that does not contain any atomic instructions (... and, why call a complicated function when the target CPU supports that kind of thing natively). — Damon, Nov 14 '11 at 13:21

chill · Accepted Answer · 2011-11-14T21:07:13.990

It is an inadequate compiler build.

Check your c++config.h, it shoukld look like this, but it doesn't:

/* Define if builtin atomic operations for bool are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_1 1

/* Define if builtin atomic operations for short are supported on this host.
   */
#define _GLIBCXX_ATOMIC_BUILTINS_2 1

/* Define if builtin atomic operations for int are supported on this host. */
#define _GLIBCXX_ATOMIC_BUILTINS_4 1

/* Define if builtin atomic operations for long long are supported on this
   host. */
#define _GLIBCXX_ATOMIC_BUILTINS_8 1

These macros are defined or not depending on configure tests, which check host machine support for __sync_XXX functions. These tests are in libstdc++v3/acinclude.m4, AC_DEFUN([GLIBCXX_ENABLE_ATOMIC_BUILTINS] ....

On your installation, it's evident from the MEM[(__i_type *)&a] put in the assembly file by -fverbose-asm that the compiler uses macros from atomic_0.h, for example:

#define _ATOMIC_LOAD_(__a, __x)                        \
  ({typedef __typeof__(_ATOMIC_MEMBER_) __i_type;                          \
    __i_type* __p = &_ATOMIC_MEMBER_;                      \
    __atomic_flag_base* __g = __atomic_flag_for_address(__p);          \
    __atomic_flag_wait_explicit(__g, __x);                 \
    __i_type __r = *__p;                           \
    atomic_flag_clear_explicit(__g, __x);                      \
    __r; })

With a properly built compiler, with your example program, c++ -m32 -std=c++0x -S -O2 -march=core2 -fverbose-asm should produce something like this:

movl    $5, 28(%esp)    #, a.D.5442._M_i
lock addl   $1, 28(%esp)    #,
mfence
movl    28(%esp), %eax  # MEM[(const struct __atomic_base *)&a].D.5442._M_i, __ret
mfence
movl    $.LC0, (%esp)   #,
movl    %eax, 4(%esp)   # __ret,
call    printf  #

And guess what, editing `c++config.h` to contain those defines fixes the issue, giving me exactly the `lock addl, mfence` sequence that you posted above, which is what I wanted, too. (I'll forward the issue to my compiler builder). Thank you very much. — Damon, Nov 15 '11 at 09:36

score 3 · Answer 2 · edited May 23 '17 at 10:29

3

There are two implementations. One that uses the __sync primitives and one that does not. Plus a mixture of the two that only uses some of those primitives. Which is selected depends on macros _GLIBCXX_ATOMIC_BUILTINS_1, _GLIBCXX_ATOMIC_BUILTINS_2, _GLIBCXX_ATOMIC_BUILTINS_4 and _GLIBCXX_ATOMIC_BUILTINS_8.

At least the first one is needed for the mixed implementation, all are needed for the fully atomic one. It seems that whether they are defined depends on target machine (they may not be defined for -mi386 and should be defined for -mi686).

edited May 23 '17 at 10:29

Community

1
1

answered Nov 14 '11 at 12:22

Jan Hudec

73,652
13
125
172

None of these are defined here although atomic insns are certainly available (I'm compiling for `-march=core2`) and work without problem using the `__sync` functions. I've tried to define these macros before including `` just to see if that makes a difference, doesn't though. So basically you're saying this is probably a kind of "poor man's fallback implementation"? In that case, how would I enable the real one (without compiling my own gcc)? – Damon Nov 14 '11 at 13:33

Why is GCC std::atomic increment generating inefficient non-atomic assembly?

2 Answers2