5

I have seen in many SO answers that kind of code:

template <typename T> 
inline T imax (T a, T b)
{
    return (a > b) * a + (a <= b) * b;
}

Where authors say that this branchless.

But is this really branchless on current architectures? (x86, ARM...) And is there a real standard guarantee that this is branchless?

mch
  • 9,424
  • 2
  • 28
  • 42
galinette
  • 8,896
  • 2
  • 36
  • 87
  • Why do you need such guarantee? – Slava Dec 08 '15 at 15:04
  • 4
    The C++ standard guarantees nothing about the machine code (or even that there is machine code), except that it, or whatever is used instead, will reproduce the effects required by the semantics of the C++ statements. These effects are changes to memory contents and calls of library functions. – Cheers and hth. - Alf Dec 08 '15 at 15:09
  • 1
    It is not branchless on a number of embedded processors in 2015. – chux - Reinstate Monica Dec 08 '15 at 15:16
  • What Alf said. Note that there is also the variant where you use an array, `const T v[] = {b, a}; return v[a>b]`, which, depending on the predictability of the code, may run even faster. However, only do this kind of microoptimization if you have measured relevance. – Sebastian Mach Dec 08 '15 at 15:20
  • 4
    @Slava : Because I want to do heavy premature optimizations & over-engineering. – galinette Dec 08 '15 at 15:29

2 Answers2

6

x86 has the SETcc family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.

If you use the “naïve” approach

int imax(int a, int b) {
    return a > b ? a : b;
}

The compiler would generate even more efficient branch-less code using the CMOVcc (conditional move) family of instructions.

ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • 3
    So in other words: by attempting to manually optimize the code, instead of writing as readable and simple code as possible, the code turned both uglier and slower. – Lundin Dec 08 '15 at 15:24
  • Which is not surprising at all. Compiler writers generally know chipset instructions much better than C++ programmers, by definition. – SergeyA Dec 08 '15 at 15:25
  • That's a good answer for the platform part, but is there really a guarantee to be branchless in the standard? And I suppose that using if/else will work as your "naïve" approach, which basically means that these pseudo optimized answers I'm citing as degrading performance compared to std functions (min, max, ...) – galinette Dec 08 '15 at 15:34
  • @galinette I'm not familiar with the C++ standard but if it's anything like the C standard, it makes absolutely now guarantees about how something is implemented. From the perspective of the standard, even addition might be implemented with branching. – fuz Dec 08 '15 at 15:35
  • @galinette I removed the sentence you added to the answer as I'm not familiar with the C++ standard and don't want to make any claims about it. – fuz Dec 09 '15 at 09:19
  • I'm not sure the x86 solution would work on a non-integer version of this function. – Leeor Dec 23 '15 at 00:46
  • @Leeor When an 387 FPU is used, [it's possible](https://en.wikipedia.org/wiki/FCMOV). I'm not sure about MMX and SSE though. – fuz Dec 23 '15 at 00:48
  • @Leeor And for SSE, there's the `minsd` instruction which does that in one operation. – fuz Dec 23 '15 at 00:50
0

I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…

const struct op {
    const char *foo;
    int bar;
    int flags;
} ops[] = {
    { "foo", 5, 16 },
    { "bar", 9, 16 },
    { "baz", 13, 0 },
    { 0, 0, 0 }
};

extern int foo(const struct op *, int);

int
bar(void *a, void *b, int c, const struct op *d)
{
    c |= (a == b) && (d->flags & 16);
    return foo(d, c) + 1;
}

… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s:

[…]
    .text
    .p2align 2,,3
    .globl   bar
    .type    bar , @function
bar:
    push     %ebp
    mov      %ebp, %esp
    push     %ebx
    push     %eax
    mov      %eax, DWORD PTR [%ebp+12]
    xor      %ecx, %ecx
    cmp      DWORD PTR [%ebp+8], %eax
    mov      %edx, DWORD PTR [%ebp+16]
    mov      %ebx, DWORD PTR [%ebp+20]
    je       .L4
.L2:
    sub      %esp, 8
    or       %edx, %ecx
    push     %edx
    push     %ebx
    call     foo
    inc      %eax
    mov      %ebx, DWORD PTR [%ebp-4]
    leave
    ret
    .p2align 2,,3
.L4:
    test     BYTE PTR [%ebx+8], 16
    je       .L2
    mov      %cl, 1
    jmp      .L2
    .size    bar , . - bar

Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.

Not even using !!(a == b) makes a difference here.

tl;dr

Check the actual compiler output (assembly with -S or disassembly with objdump -d -Mintel x.o; drop the -Mintel if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.

mirabilos
  • 5,123
  • 2
  • 46
  • 72
  • Note that if you compile for 32 bit, you might need to specify a sufficiently recent architecture (e.g. `-march=i686`) because the required branch-free instructions were only added some 25 years ago. – fuz Sep 28 '20 at 07:50
  • And note that in this specific case, branch-free code cannot be used because the compiler cannot assume that it is allowed to dereference `d->flags` if `a != b`. Thus, it has to emit a branch, guarding this dereference. – fuz Sep 28 '20 at 07:53