Why is gcc emitting worse code with __builtin_unreachable?

Question

With f0 and f1 as below,

long long b;

void f0(int a) {
    a %= 10;
    if (a == 0) b += 11;
    else if (a == 1) b += 13;
    else if (a == 2) b += 17;
    else if (a == 3) b += 19;
    else if (a == 4) b += 23;
    else if (a == 5) b += 29;
    else if (a == 6) b += 31;
    else if (a == 7) b += 37;
    else if (a == 8) b += 41;
    else if (a == 9) b += 43;
}

void f1(int a) {
    a %= 10;
    if (a == 0) b += 11;
    else if (a == 1) b += 13;
    else if (a == 2) b += 17;
    else if (a == 3) b += 19;
    else if (a == 4) b += 23;
    else if (a == 5) b += 29;
    else if (a == 6) b += 31;
    else if (a == 7) b += 37;
    else if (a == 8) b += 41;
    else if (a == 9) b += 43;
    else __builtin_unreachable();
}

assuming the argument a is always positive in the program, the compiler should produce more optimized code for f1 because in f0, a can fall through the if-else block when it is negative, so the compiler should produce a default "do nothing and return" code. However in f1, the possible range of a is clearly stated with __builtin_unreachable so that the compiler doesn't have to think when a is out of range.

However, f1 actually runs slower, so I had a look at the disassembly. This is the control flow part of f0.

    jne .L2
    addq    $11, b(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L2:
    cmpl    $9, %eax
    ja  .L1
    movl    %eax, %eax
    jmp *.L5(,%rax,8)
    .section    .rodata
    .align 8
    .align 4
.L5:
    .quad   .L1
    .quad   .L13
    .quad   .L12
    .quad   .L11
    .quad   .L10
    .quad   .L9
    .quad   .L8
    .quad   .L7
    .quad   .L6
    .quad   .L4
    .text
    .p2align 4,,10
    .p2align 3
.L4:
    addq    $43, b(%rip)
.L1:
    ret
    .p2align 4,,10
    .p2align 3
.L6:
    addq    $41, b(%rip)
    ret
    .p2align 4,,10
    .p2align 3
...

gcc smartly turns the if-else block into a jump table and places the default case L1 inside L4 to save space.

Now have a look at the whole control flow of f1 disassembled.

    jne .L42
    movq    b(%rip), %rax
    addq    $11, %rax
.L43:
    movq    %rax, b(%rip)
    ret
    .p2align 4,,10
    .p2align 3
.L42:
    movl    %eax, %eax
    jmp *.L46(,%rax,8)
    .section    .rodata
    .align 8
    .align 4
.L46:
    .quad   .L45
    .quad   .L54
    .quad   .L53
    .quad   .L52
    .quad   .L51
    .quad   .L50
    .quad   .L49
    .quad   .L48
    .quad   .L47
    .quad   .L45
    .text
    .p2align 4,,10
    .p2align 3
.L47:
    movq    b(%rip), %rax
    addq    $41, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L48:
    movq    b(%rip), %rax
    addq    $37, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L49:
    movq    b(%rip), %rax
    addq    $31, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L50:
    movq    b(%rip), %rax
    addq    $29, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L51:
    movq    b(%rip), %rax
    addq    $23, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L52:
    movq    b(%rip), %rax
    addq    $19, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L53:
    movq    b(%rip), %rax
    addq    $17, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L54:
    movq    b(%rip), %rax
    addq    $13, %rax
    jmp .L43
    .p2align 4,,10
    .p2align 3
.L45:
    movq    b(%rip), %rax
    addq    $43, %rax
    jmp .L43

Yes gcc did catch __builtin_unreachable, but for some reason, there is an unnecessary jump before each return, and the jump table has a duplicate entry of L45. Also instead of simply addq $N, b(%rip), it keeps writing movq b(%rip), %rax, addq $N, %rax, then movq %rax, b(%rip) before return.

What has made gcc produce apparently dumb code?

The binary was compiled with -O3 under Fedora Linux, and the gcc version I'm using is 11.2.1 20211203

Probably should just report this on GCC's bugzilla, https://gcc.gnu.org/bugzilla/, with the keyword "missed-optimization" — Peter Cordes, Dec 09 '21 at 10:17
What happens if you change `else if (a == 9)` to a simple `else`? — Ian Abbott, Dec 09 '21 at 10:34
1. The code is not same. 2. This behavior has changed in the 11.xx version. Prior to 11 the generated code was exactly the same. 3. Did you actually test the execution time? Number of instructions does not have to be the best performance indicator. `-Os` makes the generated code identical — 0___________, Dec 09 '21 at 11:26
@0___________ I did check the execution time and `f1` is consistently slower. `-Os` does not make the generated code identical in my gcc version. The unnecessary `jmp`, `ret` and `movq`, `addq`, `movq` sequence still remain. — , Dec 09 '21 at 11:52
A simpler way to convince the compiler that the result of `a % 10` must be between 0 and 9 is just to declare `a` as `unsigned`. It doesn't give you quite what you want, but does give the equivalent of `else b += 43;` like in @IanAbbott's suggestion. — Nate Eldredge, Dec 09 '21 at 15:00
clang is able to figure out that it can use a lookup table for the addend, and avoid the less efficient jump table: https://godbolt.org/z/vbv7vdWc9 Though even with `__builtin_unreachable()` it doesn't notice that it could skip the range check. (It'd be nice if it could figure out that the table entries could be 8 bits instead of 64 and loaded with zero-extension - would save 70 bytes.) Anyway, if you really need to optimize something like this, you might want to just write it as a lookup table in the first place. — Nate Eldredge, Dec 09 '21 at 15:11
@NateEldredge Good point. I keep forgetting that the C `%` operator is a remainder operator and not a modulo operator. — Ian Abbott, Dec 09 '21 at 15:11
@NateEldredge I agree clang outputs better code in this case. This question is based on mere curiosity or just checking if I missed something before reporting a bug, rather than there actually is a practical problem. Still gcc's compiled and optimized output of `f1` looks very wrong to me, given that it can at least do `f0`. — , Dec 09 '21 at 16:36
Just for the record, either way of branching is pretty crap compared to just looking up a value to add, as Nate pointed that `clang` is able to do. You can manually use `static const uint8_t LUT={[0] = 11, [1] = 13, ...};` and `b += LUT[a];` which should compile to a movzx and a memory-destination add, with no branching at all. — Peter Cordes, Dec 09 '21 at 18:37

Nate Eldredge · Accepted Answer · 2021-12-09T22:06:50.553

Here's the best explanation I can come up with.

The compiler can evidently do (at least somewhat) an optimization where code that is common to all branches of the if/else tree can be factored out (hoisted or sunk as appropriate). But in the f0 version, this optimization can't be applied because the "default" case has no code at all, and in particular does not either load or store b. So the compiler just optimizes the cases individually as best it can, leaving each one as a single RMW add-memory instruction.

In the f1 version, your __builtin_unreachable has removed the default branch. So now every branch consists, conceptually, of a load of b, an add of some constant, and a store back to b. The compiler seems to notice that they all have the store in common, and thus sinks it out - the store instruction appears just once, and every case jumps to it. Unfortunately, this actually results in worse code overall, because now the individual cases can't use the RMW add; they have to do the load and add as separate instructions. Moreover, the cases can no longer just ret by themselves; they all have to jump to the factored-out store. And the compiler has somehow not realized that the load could be hoisted out, so it is needlessly duplicated across all the cases.

I would guess part of the issue is that the hoisting/sinking is done in a target-independent pass that treats load, add, and store as independent operations. If they remain together, then some later target-specific peephole pass may combine them into the single add-memory instruction; but the earlier pass doesn't seem to consider that leaving them together might be advantageous, and thinks that any hoisting must be good. On a RISC-type load/store machine, where RMW always has to be three instructions, sinking just the store would still be somewhat helpful, maybe, but for x86 it definitely isn't.

So it's maybe two separate missed-optimization problems. The first is not noticing that the load can be hoisted (or maybe noticing but deciding not to do it), and that one seems like a clear bug. The second is not properly evaluating whether sinking the store is worth the cost of the extra jump, and that may be more a matter of applying heuristics that happen to be wrong in this case.

This problem is new in 11.x version, previous versions were emitting the same code for both. . They screwed up something. — 0___________, Dec 09 '21 at 17:24
@0___________: The 10.3 code is kind of interesting. First, it is not smart enough to use a jump table at all, having instead a chain of seven `cmp / je` (`a == 0` gets special handling for some reason, and 8 and 9 get combined into a conditional move and add.) So 11.x is an improvement in that regard. If you change `a` to `unsigned`, then 10.3 decides to hoist the load but not the store, which is at least a little better than the other way around - we still get to `ret` directly from every case instead of taking the extra jump. — Nate Eldredge, Dec 09 '21 at 17:34
Terminology: the store is *after* the branching, so pulling it out is "sinking" it, not "hoisting" it. Even though in this case GCC happened to put that basic block earlier than the add blocks. — Peter Cordes, Dec 09 '21 at 18:33
@PeterCordes: Thanks, I thought "hoisting" sounded wrong, but couldn't put my finger on what it should be instead. Will edit. — Nate Eldredge, Dec 09 '21 at 22:03

Why is gcc emitting worse code with __builtin_unreachable?

1 Answers1