11

The GCC manual only shows examples where __builtin_expect() is placed around the entire condition of an 'if' statement.

I also noticed that GCC does not complain if I use it, for example, with a ternary operator, or in any arbitrary integral expression for that matter, even one that is not used in a branching context.

So, I wonder what the underlying constraints of its usage actually are.

Will it retain its effect when used in a ternary operation like this:

int foo(int i)
{
  return __builtin_expect(i == 7, 1) ? 100 : 200;
}

And what about this case:

int foo(int i)
{
  return __builtin_expect(i, 7) == 7 ? 100 : 200;
}

And this one:

int foo(int i)
{
  int j = __builtin_expect(i, 7);
  return j == 7 ? 100 : 200;
}
Kristian Spangsege
  • 2,903
  • 1
  • 20
  • 43

1 Answers1

9

It apparently works for both ternary and regular if statements.

First, let's take a look at the following three code samples, two of which use __builtin_expect in both regular-if and ternary-if styles, and a third which does not use it at all.

builtin.c:

int main()
{
    char c = getchar();
    const char *printVal;
    if (__builtin_expect(c == 'c', 1))
    {
        printVal = "Took expected branch!\n";
    }
    else
    {
        printVal = "Boo!\n";
    }

    printf(printVal);
}

ternary.c:

int main()
{
    char c = getchar();
    const char *printVal = __builtin_expect(c == 'c', 1) 
        ? "Took expected branch!\n"
        : "Boo!\n";

    printf(printVal);
}

nobuiltin.c:

int main()
{
    char c = getchar();
    const char *printVal;
    if (c == 'c')
    {
        printVal = "Took expected branch!\n";
    }
    else
    {
        printVal = "Boo!\n";
    }

    printf(printVal);
}

When compiled with -O3, all three result in the same assembly. However, when the -O is left out (on GCC 4.7.2), both ternary.c and builtin.c have the same assembly listing (where it matters):

builtin.s:

    .file   "builtin.c"
    .section    .rodata
.LC0:
    .string "Took expected branch!\n"
.LC1:
    .string "Boo!\n"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    subl    $32, %esp
    call    getchar
    movb    %al, 27(%esp)
    cmpb    $99, 27(%esp)
    sete    %al
    movzbl  %al, %eax
    testl   %eax, %eax
    je  .L2
    movl    $.LC0, 28(%esp)
    jmp .L3
.L2:
    movl    $.LC1, 28(%esp)
.L3:
    movl    28(%esp), %eax
    movl    %eax, (%esp)
    call    printf
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Debian 4.7.2-4) 4.7.2"
    .section    .note.GNU-stack,"",@progbits

ternary.s:

    .file   "ternary.c"
    .section    .rodata
.LC0:
    .string "Took expected branch!\n"
.LC1:
    .string "Boo!\n"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    subl    $32, %esp
    call    getchar
    movb    %al, 31(%esp)
    cmpb    $99, 31(%esp)
    sete    %al
    movzbl  %al, %eax
    testl   %eax, %eax
    je  .L2
    movl    $.LC0, %eax
    jmp .L3
.L2:
    movl    $.LC1, %eax
.L3:
    movl    %eax, 24(%esp)
    movl    24(%esp), %eax
    movl    %eax, (%esp)
    call    printf
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Debian 4.7.2-4) 4.7.2"
    .section    .note.GNU-stack,"",@progbits

Whereas nobuiltin.c does not:

    .file   "nobuiltin.c"
    .section    .rodata
.LC0:
    .string "Took expected branch!\n"
.LC1:
    .string "Boo!\n"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    subl    $32, %esp
    call    getchar
    movb    %al, 27(%esp)
    cmpb    $99, 27(%esp)
    jne .L2
    movl    $.LC0, 28(%esp)
    jmp .L3
.L2:
    movl    $.LC1, 28(%esp)
.L3:
    movl    28(%esp), %eax
    movl    %eax, (%esp)
    call    printf
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Debian 4.7.2-4) 4.7.2"
    .section    .note.GNU-stack,"",@progbits

The relevant part:

diff

Basically, __builtin_expect causes extra code (sete %al...) to be executed before the je .L2 based on the outcome of testl %eax, %eax which the CPU is more likely to predict as being 1 (naive assumption, here) instead of based on the direct comparison of the input char with 'c'. Whereas in the nobuiltin.c case, no such code exists and the je/jne directly follows the comparison with 'c' (cmp $99). Remember, branch prediction is mainly done in the CPU, and here GCC is simply "laying a trap" for the CPU branch predictor to assume which path will be taken (via the extra code and the switching of je and jne, though I do not have a source for this, as Intel's official optimization manual does not mention treating first-encounters with je vs jne differently for branch prediction! I can only assume the GCC team arrived at this via trial and error).

I am sure there are better test cases where GCC's branch prediction can be seen more directly (instead of observing hints to the CPU), though I do not know how to emulate such a case succinctly/concisely. (Guess: it would likely involve loop unrolling during compilation.)

Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125
  • Very nice analysis, and very nice presentation of results. Thank you for the effort. – Kristian Spangsege Feb 09 '13 at 12:19
  • 2
    This doesn't really show anything other than that `__builtin_expect` has no effect on optimized code for x86 (since you said they were the same with -O3). The only reason they are different before is that `__builtin_expect` is a function which returns the value given to it, and that return value cannot happen through flags. Otherwise, the difference would stay in the optimized code. – ughoavgfhw Feb 10 '13 at 00:48
  • @ughoavgfhw: What do you mean by "that return value cannot happen through flags"? – Kristian Spangsege Feb 10 '13 at 14:21
  • @Kristian The calling convention does not allow a return value to be indicated by bits in the flags register, which is why the unoptimized code needs to `sete %al`. It's the built in function returning the result of the comparison. – ughoavgfhw Feb 10 '13 at 17:34
  • `__builtin_expect` is likely (well, empirically, according to your code) a no-op on such a simple piece of code, especially on x86. You should try a piece of code where the unlikely codepath executes a lot of additional instructions, and see if the compiler is smart enough to move it out of the hot path. (On x86, the branch predictor is so good that only reason to use `__builtin_expect` is to shrink the icache footprint of the hot path.) You could also try compiling for ARM or PPC, which would be more likely to have special compiler logic devoted to fooling the branch-predictor. – Quuxplusone May 27 '16 at 17:48