Unexpected x64 assembly for __atomic_fetch_or with gcc 7.3

Question

I am attempting to use a 64-bits integral as a bitmap, and acquire/release ownership of individual bits, atomically.

To this end, I have written the following lock-less code:

#include <cstdint>
#include <atomic>

static constexpr std::uint64_t NO_INDEX = ~std::uint64_t(0);

class AtomicBitMap {
public:
    static constexpr std::uint64_t occupied() noexcept {
        return ~std::uint64_t(0);
    }

    std::uint64_t acquire() noexcept {
        while (true) {
            auto map = mData.load(std::memory_order_relaxed);
            if (map == occupied()) {
                return NO_INDEX;
            }

            std::uint64_t index = __builtin_ctzl(~map);
            auto previous =
                mData.fetch_or(bit(index), std::memory_order_relaxed);
            if ((previous & bit(index)) == 0) {
                return index;
            }
        }
    }

private:
    static constexpr std::uint64_t bit(std::uint64_t index) noexcept {
        return std::uint64_t(1) << index;
    }

    std::atomic_uint64_t mData{ 0 };
};

int main() {
    AtomicBitMap map;
    return map.acquire();
}

Which, on godbolt, yields the following assembly in isolation:

main:
  mov QWORD PTR [rsp-8], 0
  jmp .L3
.L10:
  not rax
  rep bsf rax, rax
  mov edx, eax
  mov eax, eax
  lock bts QWORD PTR [rsp-8], rax
  jnc .L9
.L3:
  mov rax, QWORD PTR [rsp-8]
  cmp rax, -1
  jne .L10
  ret
.L9:
  movsx rax, edx
  ret

Which is exactly what I expected¹.

@Jester has heroically managed to reduce my 97 lines reproducer to a much simpler 44 lines reproducer which I further reduced to 35 lines:

using u64 = unsigned long long;

struct Bucket {
    u64 mLeaves[16] = {};
};

struct BucketMap {
    u64 acquire() noexcept {
        while (true) {
            u64 map = mData;

            u64 index = (map & 1) ? 1 : 0;
            auto mask = u64(1) << index;

            auto previous =
                __atomic_fetch_or(&mData, mask, __ATOMIC_SEQ_CST);
            if ((previous & mask) == 0) {
                return index;
            }
        }
    }

    __attribute__((noinline)) Bucket acquireBucket() noexcept {
        acquire();
        return Bucket();
    }

    volatile u64 mData = 1;
};

int main() {
    BucketMap map;
    map.acquireBucket();
    return 0;
}

Which generates the following assembly:

BucketMap::acquireBucket():
  mov r8, rdi
  mov rdx, rsi

.L2:
  mov rax, QWORD PTR [rsi]
  xor eax, eax
  lock bts QWORD PTR [rdx], rax
  setc al
  jc .L2
  mov rdi, r8
  mov ecx, 16
  rep stosq
  mov rax, r8
  ret

main:
  sub rsp, 152
  lea rsi, [rsp+8]
  lea rdi, [rsp+16]
  mov QWORD PTR [rsp+8], 1
  call BucketMap::acquireBucket()
  xor eax, eax
  add rsp, 152
  ret

The xor eax,eax means that the assembly here always attempts to obtain index 0... resulting in an infinite loop.

I can only see two explanations for this assembly:

I have somehow triggered Undefined Behavior.
There is a code-generation bug in gcc.

And I have exhausted all my ideas as to what could trigger UB.

Can anyone explain why gcc would generate this xor eax,eax?

Note: tentatively reported to gcc as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86314.

Compiler version used:

$ gcc --version
gcc (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is 
NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE.

Compiler flags:

-Wall -Wextra -Werror -Wduplicated-cond -Wnon-virtual-dtor -Wvla 
-rdynamic -Wno-deprecated-declarations -Wno-type-limits 
-Wno-unused-parameter -Wno-unused-local-typedefs -Wno-unused-value 
-Wno-aligned-new -Wno-implicit-fallthrough -Wno-deprecated 
-Wno-noexcept-type -Wno-register -ggdb -fno-strict-aliasing 
-std=c++17 -Wl,--no-undefined -Wno-sign-compare 
-g -O3 -mpopcnt

¹ Actually, it's better than I expected, the compiler understanding that the fetch_or(bit(index)) followed by previous & bit(index) is the equivalent of using bts and checking the CF flag is pure gold.

Can you provide a minimal larger program? Sorry :) But that indeed looks strange since the result of the `tzcnt` is moved into `rcx` which is then not used. Have you tried with other compiler versions? — Jester, Jun 25 '18 at 10:12
@Jester: The result of `rcx` is used down the road (at instruction `+288`), so it does make sense for it to be saved. I haven't managed to produce a MVCE yet, if I do I'll update the question. I don't have another version of gcc which can actually compile the code, which is why I am trying to reproduce it on godbolt. — Matthieu M., Jun 25 '18 at 10:21
@Jester: My best reduction is currently 180 lines (https://godbolt.org/g/KAbRLk). It is extremely fickle, though. Tweaking seemingly unrelated bits of code (even removing an unused data-member) shift from `xor eax,eax` to `mov eax,eax` at the drop of a hat. I still haven't identified whether this is due to UB or a compiler bug :/ — Matthieu M., Jun 25 '18 at 11:56
It is interesting that `clang-6.0.0` does not use `bts`, rather plain `cmpxchg`. And the assembly looks wrong, so it must be UB lurking in there. — Maxim Egorushkin, Jun 25 '18 at 12:42
@Jester: After further trimming and removing templates, behold, a reproducer of "only" 97 lines (https://godbolt.org/g/sc6zgn). I am still hoping to reduce it further, but it's getting more and more difficult. — Matthieu M., Jun 25 '18 at 12:42
@MaximEgorushkin: I suppose this would work too. Maybe I am trying to be too clever here :( — Matthieu M., Jun 25 '18 at 12:46
@Jester: Okay, I give up on further reducing the reproducer as I am running out of idea. Any further tiny change I do seem to flip the assembly to the expected `mov eax,eax`... I have edited the "best" assembly yielded by the MCVE in the question itself and linked to godbolt. — Matthieu M., Jun 25 '18 at 13:15
as far as I can ascertain, the `xor eax, eax` is linked to the `setc al` further, but at first glance, I don't see why the compiler thinks it needs that. Except that it seems to be triggered by the `if (bucketIndex == NO_INDEX)` check in `FlatSlotMap::acquireBucket` (removing that `if` block also gets rid of the unexplained code). Odd. — Sander De Dycker, Jun 25 '18 at 13:51
The problem is apparently with the `bit` function. Using `std::int64_t(1) << index` seems to work as does simply `1`, `1U` or `1L` but not `1UL`. — Jester, Jun 25 '18 at 14:19
@Jester: It would be rather ironic, considering that shifting `1`, `1U` or `1L` by 63 is undefined behavior while it's defined for `1UL` :x Hard to say, though, as there are many tweaks which "solve" the issue... — Matthieu M., Jun 25 '18 at 14:23
I have reduced it to [44 lines](https://godbolt.org/g/pgNaig). If you 1) change line 7 to `= 1` or 2) change line 11 as previously noted or 3) remove the `while` or 4) change line 28 to not be an `array` or 5) change `acquireBucket` to return an integer constant then it works again. Still no idea where the undefined behavior is hiding. — Jester, Jun 25 '18 at 14:37
Actually changing line 28 to `BitMap mLocalLeaves[x];` breaks for `x >= 11` which is when the zeroing switches to a `rep stosq` from separate SSE stores. — Jester, Jun 25 '18 at 14:41
@Jester: Ah, that's interesting. It started happening when I went from a number of leaves of 8 to a number of leaves of 16, which indeed passes the 11 threshold. I think this is the right track, `stosq` stores RAX at the address RDI. Since we want to zero the leaves, it needs RAX to be 0. — Matthieu M., Jun 25 '18 at 14:43
As such we can even get rid of `BitMap` and just stick something like `int mLocalLeaves[128];` into the `LocalBucket`. Getting small enough for a gcc bug report :) — Jester, Jun 25 '18 at 14:50
@Jester: Yes, I further reduced it to 35 lines, with no include at all. — Matthieu M., Jun 25 '18 at 15:03
@Jester: Logged GCC bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86314 — Matthieu M., Jun 25 '18 at 15:22
@Jester: The bug has been confirmed by gcc developers, apparently a peephole optimization issue. May I encourage you to make an answer of it, as you were instrumental in narrowing down the issue? — Matthieu M., Jun 26 '18 at 08:55
`xor eax, eax` generated only as part `return Bucket();`. but it on wrong place. really must be after loop. after (`jc .L2`). also if you change initialization of `Bucket`, this also can changed - say with this - `u64 mLeaves[16] ={1,1,1,1,1,1};` is gone, when with this `u64 mLeaves[16] ={1,1,1,1,1};` still exist. — RbMm, Jun 26 '18 at 15:06
of course if you remove `return Bucket();` and change to `void acquireBucket() ` again no `xor eax,eax` - error related to return structure — RbMm, Jun 26 '18 at 15:11
really here of course must be `and eax, 1` - `u64 index = map & 1; ` ( this is better compare `u64 index = (map & 1) ? 1 : 0;` ) - https://godbolt.org/g/VjErNq — RbMm, Jun 26 '18 at 15:23

score 5 · Accepted Answer · answered Jun 28 '18 at 18:26

This is peephole optimization bug in gcc, see #86413 affecting versions 7.1, 7.2, 7.3 and 8.1. The fix is already in, and will be delivered in version 7.4 and 8.2 respectively.

The short answer is that the particular code sequence (fetch_or + checking result) generates a setcc (set conditional, aka based on status of flags) followed by a movzbl (move and zero-extend); in 7.x an optimization was introduced which transforms a setcc followed by movzbl into a xor followed by setcc, however this optimization was missing some checks resulting in the xor possibly clobbering a register which was still needed (in this case, eax).

The longer answer is that fetch_or can be implemented either as a cmpxchg for full generality, or, if only setting one bit, as bts (bit test and set). As another optimization introduced in 7.x, gcc now generates a bts here (gcc 6.4 still generates a cmpxchg). bts sets the carry flag (CF) to the previous value of the bit.

That is, auto previous = a.fetch_or(bit); auto n = previous & bit; will generate:

lock bts QWORD PTR [<address of a>], <bit index> to set the bit, and capture its previous value,
setc <n>l to set the lower 8 bits of r<n>x to the value of the carry flag (CF),
movzx e<n>x, <n>l to zero-out the upper bits of r<n>x.

And then the peephole optimization will apply, which messes things up.

gcc trunk now generates proper assembly:

BucketMap::acquireBucket():
    mov rdx, rdi
    mov rcx, rsi
.L2:
    mov rax, QWORD PTR [rsi]
    and eax, 1
    lock bts QWORD PTR [rcx], rax
    setc al
    movzx eax, al
    jc .L2
    mov rdi, rdx
    mov ecx, 16
    rep stosq
    mov rax, rdx
    ret
main:
    sub rsp, 152
    lea rsi, [rsp+8]
    lea rdi, [rsp+16]
    mov QWORD PTR [rsp+8], 1
    call BucketMap::acquireBucket()
    xor eax, eax
    add rsp, 152
    ret

Although unfortunately the optimization no longer applies so we are left with setc + mov instead of xor + setc... but at least it's correct!

But why is there a `setc` at all? It makes no sense for the `rep stosq` that fills `mLeaves[16]` to have a data dependency on the `lock bts`. (The `jc` on flags set by BTS can only fall through when `rax` is 0 so it's not actually wrong, but a normal `xor eax,eax` after the loop would be a much better way to get a zero.) The source code is just `acquire(); return Bucket();` but gcc is adding code inside the `acquire` loop to create a 0/1 register from CF, so `Bucket()` will have a `0`. — Peter Cordes, Jun 28 '18 at 19:49
Perhaps the `lock bts` peephole ate up the code that would have left a 0 in a reg naturally. — Peter Cordes, Jun 28 '18 at 19:51
Reported [Bug 86352](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352). I probably should have reported 5 separate bugs for all the stuff that's bad about this code-gen even though they all come from this one function. :/ — Peter Cordes, Jun 28 '18 at 20:55
@PeterCordes: Thanks for following on that, correctness was of more immediate concern but there is indeed to reason not to improve performance now! — Matthieu M., Jun 29 '18 at 07:05
@PeterCordes: I'm pretty sad that your bug has not received more attention. I am glad that wrong codegens are immediately picked up (given how dangerous they are), but it would be nice to at least see some acknowledgement of the issue reported :( — Matthieu M., Jul 03 '18 at 13:34
I should probably split it up into multiple separate bugs that can be fixed separately. But yeah, getting gcc missed optimization bugs fixed is often slow or doesn't happen at all. Some of these look like pretty obvious bugs, though, that shouldn't be too hard for algorithms to find / avoid. — Peter Cordes, Jul 03 '18 at 20:17

Maxim Egorushkin · Answer 2 · 2018-06-25T12:57:02.497

1

As a side note, you can find the lowest 0 bit with a straight-forward bit manipulation:

template<class T>
T find_lowest_0_bit_mask(T value) {
    T t = value + 1;
    return (t ^ value) & t;
}

Returns bit mask, rather than bit index.

Precoditions: T must be unsigned, value must contain at least 1 zero bit.

mData.load must synchronise with mData.fetch_or, so it should be

mData.load(std::memory_order_acquire)

and

mData.fetch_or(..., std::memory_order_release)

And, IMO, there is something about these bit intrinsics that make it generate wrong assembly with clang, see .LBB0_5 loop that is clearly wrong because it keeps trying to set the same bit rather than recalculating another bit to set. A version that generates correct assembly:

#include <cstdint>
#include <atomic>

static constexpr int NO_INDEX = -1;

template<class T>
T find_lowest_0_bit_mask(T value) {
    T t = value + 1;
    return (t ^ value) & t;
}

class AtomicBitMap {
public:
    static constexpr std::uint64_t occupied() noexcept { return ~std::uint64_t(0); }

    int acquire() noexcept {
        auto map = mData.load(std::memory_order_acquire);
        while(map != occupied()) {
            std::uint64_t mask = find_lowest_0_bit_mask(map);
            if(mData.compare_exchange_weak(map, map | mask, std::memory_order_release))
                return __builtin_ffsl(mask) - 1;
        }
        return NO_INDEX;
    }

    void release(int i) noexcept {
        mData.fetch_and(~bit(i), std::memory_order_release);
    }

private:
    static constexpr std::uint64_t bit(int index) noexcept { 
        return std::uint64_t(1) << index; 
    }

    std::atomic_uint64_t mData{ 0 };
};

edited Jun 25 '18 at 12:57

answered Jun 25 '18 at 12:09

Maxim Egorushkin

131,725
17
180
271

This formula doesn't appear to work for numbers of the form 2^n - 1. For example, for 7 (0b111) the expected answer would be 3 or 4 (depending on whether this is 0-based or 1-based). Instead, 7 ^ 8 yields 0b1111 which and-ed with 7 yields 8. – Matthieu M. Jun 25 '18 at 12:17
1

@MatthieuM For `7` the expected answer is `8`. – Maxim Egorushkin Jun 25 '18 at 12:19
As for synchronization, acquire/release memory orders are only necessary when to sequence reads/writes on another sections of memory; relaxed should be sufficient (AFAIK) when only the atomic in question is modified. – Matthieu M. Jun 25 '18 at 12:19
@MatthieuM. You are right. It is expected that there is a sequence of reads and writes between `AtomicBitMap::acquire` and `AtomicBitMap::release` (not shown) here. – Maxim Egorushkin Jun 25 '18 at 12:21
I finally understand the formula: it gives the mask, not the index. Clever. Unfortunately I need the index at the end, the function must return a number between 0 and 63 (or a sentinel value to test against). – Matthieu M. Jun 25 '18 at 12:24
@MatthieuM. Added a version that returns an index and generates expected assembly. – Maxim Egorushkin Jun 25 '18 at 12:51
Following your observations, I have tweaked the code to use a `cmpxchg` instead which does not trigger the issue. I am still looking forward to understand what is wrong either my code or gcc which leads to this mysterious assembly being emitted though, since whether my code triggers UB or gcc is buggy, in either case it's a time bomb waiting to go off. – Matthieu M. Jun 25 '18 at 13:09
1

Wow that first Clang code is a mess, it even truncates `NO_INDEX` – harold Jun 25 '18 at 13:49
1

@MatthieuM.: `bts` is not great for efficiency: the crazy-CISC bit-string semantics that allow it to index outside the memory operand selected by the addressing mode require lots of uops. (Not sure how big a deal this is for the `lock` version. It's significant for the non-`lock`ed version with a memory operand). So it might be best to write a manual OR + cmpxchg loop to set the lowest bit with `desired = old | old+1;`, and `ctz` to find out *which* bit that was. x86 has no `fetch_or` in hardware; it can only do it with `lock cmpxchg` or `lock bts` if you use the result. – Peter Cordes Jun 26 '18 at 09:32
1

@PeterCordes: I was wondering about that, haven't checked Agner's guide yet. The main advantage for my usecase is that `lock bts` allows a concurrent "release" (set a bit to 0) whereas `lock cmpxchg` will fail in that case, which I thought was a nice property. Benchmark-wise, it was lost in the noise though, so it's not a big loss. – Matthieu M. Jun 26 '18 at 10:57
@MatthieuM. Both `lock bts` and `lock cmpxchg` turn the entire cache line into exclusive mode invalidating all other copies. There may be no observable benefit in using `lock bts` over `lock cmpxchg`. – Maxim Egorushkin Jun 26 '18 at 11:40
@MaximEgorushkin: Sure, but the line could change between `load` & `bts`/`cmpxchg`; `bts` has the advantage that it only cares about the state of 1 bit, whereas `cmpxchg` cares about the state of 64 bits. I expect the use of `cmpxchg` to result in more iterations of the loop. Of course, in practice, the algorithm I wrote is using this operation sparingly (because contention hurts performance) so the benefit of `fetch_or` was more for its elegance than performance and therefore I don't mind switching to CAS until the bug is fixed. – Matthieu M. Jun 26 '18 at 12:12
@MatthieuM.: According to BeeOnRope (who has more *practical* experience with this than I do), in real life contention is usually rare so `cmpxchg` very rarely has to retry. Or if there's so much contention that it does retry often, then you're pretty much screwed on your program running fast. Once the cache line arrives to satisfy the load, usually the `lock cmpxchg` can also complete without losing it. You could put a counter inside the retry loop to count retries as a profiling build, if you're curious. – Peter Cordes Jun 26 '18 at 16:12
@PeterCordes: I agree with BeeOnRope ;) I liked `bts` for its elegance and theoretical superiority, but the simplest performance advice in multi-threading is *avoid contention*. Since this acquire/release is not supposed to happen often, I do not expect much contention on this cache-line, and therefore `bts` and `cmpxchg` should perform similarly. My benchmarks certainly seem to agree. I am more worried about which other instructions (on top of `__atomic_fetch_or`) could be miscompiled. – Matthieu M. Jun 26 '18 at 17:15
@MatthieuM.: indeed, that's worrying. If it's something about BTS patterns that are broken, then probably fetch_and and fetch_xor with masks that clear a bit or flip a bit (`btr` / `btc`). – Peter Cordes Jun 26 '18 at 17:17
@PeterCordes: I asked some clarifications from Jakub in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86314; unfortunately he may have missed my question. He pushed a fix for the bug, but I could not understand the scope of the issue; I hope it is specific to `setcc + movzbl` to `xor + setcc`. – Matthieu M. Jun 26 '18 at 17:53

score 0 · Answer 3 · answered Jun 25 '18 at 16:02

xor-zero / set flags / setcc is usually the best way to create a 32-bit 0/1 integer.

Obviously it's only safe to do this if you have a spare register to xor-zero without destroying any inputs to the flag-setting instruction(s), so this is pretty clearly a bug.

(Otherwise you can setcc dl / movzx eax,dl. Separate regs are preferable so the movzx can be zero latency (mov-elimination) on some CPUs, but it's on the critical path on other CPUs so the xor/set-flags / setcc idiom is preferable because fewer instructions are on the critical path.)

IDK why gcc creates the integer value of (previous & mask) == 0 in a register at all; that's probably part of the bug.

Unexpected x64 assembly for __atomic_fetch_or with gcc 7.3

3 Answers3