18

I've been experimenting with the following and have noticed that the branchless “if” defined here (now with &-!! replacing *!!) can speed up certain bottleneck code by as much as (almost) 2x on 64-bit Intel targets with clang:

// Produces x if f is true, else 0 if f is false.
#define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))

// Produces x if f is true, else y if f is false.
#define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                     ((y) & -((typeof(y)) !(f))))

Note that f should be a reasonably simple expression with no side-effects, so that the compiler is able to do its best optimizations.

Performance is highly dependent on CPU and compiler. The branchless ‘if’ performance is excellent with clang; I haven't found any cases yet where the branchless ‘if/else’ is faster, though.

My question is: are these safe and portable as written (meaning guaranteed to give correct results on all targets), and can they be made faster?

Example usage of branchless if/else

These compute 64-bit minimum and maximum.

inline uint64_t uint64_min(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a <= b), a, b);
}

inline uint64_t uint64_max(uint64_t a, uint64_t b)
{
  return BRANCHLESS_IF_ELSE((a >= b), a, b);
}

Example usage of branchless if

This is 64-bit modular addition — it computes (a + b) % n. The branching version (not shown) suffers terribly from branch prediction failures, but the branchless version is very fast (at least with clang).

inline uint64_t uint64_add_mod(uint64_t a, uint64_t b, uint64_t n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  uint64_t c = a + b - BRANCHLESS_IF((a >= n - b), n);

  assert(c < n);
  return c;
}

Update: Full concrete working example of branchless if

Below is a full working C11 program that demonstrates the speed difference between branching and a branchless versions of a simple if conditional, if you would like to try it on your system. The program computes modular exponentiation, that is (a ** b) % n, for extremely large values.

To compile, use the following on the command line:

  • -O3 (or whatever high optimization level you prefer)
  • -DNDEBUG (to disable assertions, for speed)
  • Either -DBRANCHLESS=0 or -DBRANCHLESS=1 to specify branching or branchless behavior, respectively

On my system, here's what happens:

$ cc -DBRANCHLESS=0 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 0
CPU time:  21.83 seconds
foo = 10585369126512366091

$ cc -DBRANCHLESS=1 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
BRANCHLESS = 1
CPU time:  11.76 seconds
foo = 10585369126512366091

$ cc --version
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz. Intel Core i7).

// SPEED TEST OF MODULAR MULTIPLICATION WITH BRANCHLESS CONDITIONALS

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>

typedef  uint64_t  uint64;

//------------------------------------------------------------------------------
#if BRANCHLESS
  // Actually branchless.
  #define  BRANCHLESS_IF(f,x)          ((x) & -((typeof(x))!!(f)))
  #define  BRANCHLESS_IF_ELSE(f,x,y)  (((x) & -((typeof(x))!!(f))) | \
                                       ((y) & -((typeof(y)) !(f))))
#else
  // Not actually branchless, but used for comparison.
  #define  BRANCHLESS_IF(f,x)          ((f)? (x) : 0)
  #define  BRANCHLESS_IF_ELSE(f,x,y)   ((f)? (x) : (y))
#endif

//------------------------------------------------------------------------------
// 64-bit modular multiplication.  Computes (a * b) % n without division.

static uint64 uint64_mul_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n); assert(b < n);

  if (a < b) { uint64 t = a; a = b; b = t; }  // Ensure that b <= a.

  uint64 c = 0;
  for (; b != 0; b /= 2)
  {
    // This computes c = (c + a) % n if (b & 1).
    c += BRANCHLESS_IF((b & 1), a - BRANCHLESS_IF((c >= n - a), n));
    assert(c < n);

    // This computes a = (a + a) % n.
    a += a - BRANCHLESS_IF((a >= n - a), n);
    assert(a < n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
// 64-bit modular exponentiation.  Computes (a ** b) % n using modular
// multiplication.

static
uint64 uint64_pow_mod(uint64 a, uint64 b, const uint64 n)
{
  assert(n > 1); assert(a < n);

  uint64 c = 1;

  for (; b > 0; b /= 2)
  {
    if (b & 1)
      c = uint64_mul_mod(c, a, n);

    a = uint64_mul_mod(a, a, n);
  }

  assert(c < n);
  return c;
}

//------------------------------------------------------------------------------
int main(const int argc, const char *const argv[const])
{
  printf("BRANCHLESS = %d\n", BRANCHLESS);

  clock_t clock_start = clock();

  #define SHOW_RESULTS 0

  uint64 foo = 0;  // Used in forcing compiler not to throw away results.

  uint64 n = 3, a = 1, b = 1;
  const uint64 iterations = 1000000;
  for (uint64 iteration = 0; iteration < iterations; iteration++)
  {
    uint64 c = uint64_pow_mod(a%n, b, n);

    if (SHOW_RESULTS)
    {
      printf("(%"PRIu64" ** %"PRIu64") %% %"PRIu64" = %"PRIu64"\n",
             a%n, b, n, c);
    }
    else
    {
      foo ^= c;
    }

    n = n * 3 + 1;
    a = a * 5 + 3;
    b = b * 7 + 5;
  }

  clock_t clock_end = clock();
  double elapsed = (double)(clock_end - clock_start) / CLOCKS_PER_SEC;
  printf("CPU time:  %.2f seconds\n", elapsed);

  printf("foo = %"PRIu64"\n", foo);

  return 0;
}

Second update: Intel vs. ARM performance

  • Testing on 32-bit ARM targets (iPhone 3GS/4S, iPad 1/2/3/4, as compiled by Xcode 6.1 with clang) reveals that the branchless “if” here is actually about 2–3 times slower than ternary ?: for the modular exponentiation code in those cases. So it seems that these branchless macros are not a good idea if maximum speed is needed, although they might be useful in rare cases where constant speed is needed.
  • On 64-bit ARM targets (iPhone 6+, iPad 5), the branchless “if” runs the same speed as ternary ?: — again as compiled by Xcode 6.1 with clang.
  • For both Intel and ARM (as compiled by clang), the branchless “if/else” was about twice as slow as ternary ?: for computing min/max.
Todd Lehman
  • 2,880
  • 1
  • 26
  • 32
  • 7
    You're saying that these are faster than `f ? a: b`? – Oliver Charlesworth Aug 08 '15 at 19:38
  • 1
    Also worth noting that `f` is evaluated twice in the second version, which may have undesirable side-effects. – Oliver Charlesworth Aug 08 '15 at 19:40
  • Also note that it if `a` or `b` are NAN or infinity, weird things will happen. – Oliver Charlesworth Aug 08 '15 at 19:41
  • @OliverCharlesworth — (1) Yup! They can be much faster than `f ? a : b`. I've seen 2x speedups. (2) Ya, you have to be careful not to do anything with side-effects when passing `f`. `f` should be a simple expression, in which case it is not actually evaluated twice, because modern compilers are excellent about redundant subexpression elimination. (3) `a` and `b` can't be `NaN` because these are only intended to be used for integers. – Todd Lehman Aug 08 '15 at 19:59
  • 1
    It might help to know that the expression `(uint32_t)(x|(-x))>>31` is equivalent to `x==0? 0:1`. See [here](http://stackoverflow.com/a/25906424/1382251) for more details. – barak manos Aug 08 '15 at 20:17
  • 1
    I find that genuinely surprising; I would expect the compiler authors to have done something optimal for this. – Oliver Charlesworth Aug 08 '15 at 20:20
  • 1
    Another trick: `#define MIN(a,b) (a & (signed)((a-b)>>63)) | (b & ~(signed)((a-b)>>63))`. – barak manos Aug 08 '15 at 20:39
  • @OliverCharlesworth — Ya, it really is surprising! Hey, I just updated the question to include a full working example program that clocks itself, if you want to try it on your own system. (It's even got a nested branchless if. :-) – Todd Lehman Aug 08 '15 at 21:09
  • 1
    I see you hate typing the _t. – StackedCrooked Aug 08 '15 at 21:12
  • On my machine, with gcc 4.9.2, the branchless version (16.6s) is slightly **slower** than the version with branches (15.5) – dyp Aug 08 '15 at 21:22
  • 2
    Though I can reproduce the OP's observations when using clang 3.6.0 (branchless 10.7s being almost 2x as fast as with branches 19.5s). – dyp Aug 08 '15 at 21:28
  • @dyp — On one system I tried this on with gcc, it was actually twice as bad with branchless. But now I've rewritten the macros to replace the multiplication (`*!!`) with a bitwise logical and (`&-!!`), and it now runs the same speed for me with gcc whether it's branching or branchless. – Todd Lehman Aug 09 '15 at 00:33
  • 1
    It's not portable because it depends on the representation of -1 being all 1's. There's nothing in the C standard that requires this. Specifically, it wouldn't work on a machine with 1's complement or sign-magnitude arithmetic. Having said that, I can't cite a modern machine that doesn't use 2's complement for integers. – Gene Aug 09 '15 at 02:24
  • @Gene, if this is applied to unsigned types this has nothing to do with the sign representation. `-1` converted to an unsigned type is always all-1's. – Jens Gustedt Aug 10 '15 at 06:24
  • 1
    @JensGustedt — Are you sure? `-1` in two's complement is `0xFFFFFFFFFFFFFFFF`, but in one's complement it is `0xFFFFFFFFFFFFFFFE`, isn't it? It seems to me, Gene is correct, in which case using `&` against `-1` would not be good if the target system used one's complement. – Todd Lehman Aug 10 '15 at 20:18
  • 1
    @JensGustedt Indeed, Todd Lehman is correct. And in sign-magnitude, -1's representation is 0x80000000000000001. So the macro's behavior is undefined wrt the C standard. It will work fine in Java, where 2's complement representation is required. – Gene Aug 11 '15 at 01:21
  • 1
    I think you both misread what I was saying. I was talking about applying the macro to any unsigned type. In C, conversion is done by value and not by representation. `-1` converted to any unsigned type is always the maximum value of that unsigned type, and that in turn is the value with all-1. This is not a question of the platform and even less of the representation of the signed types. – Jens Gustedt Aug 11 '15 at 06:28
  • 2
    Even if this was not faster, it could still be useful in crypto for reducing timing attacks. – technosaurus Aug 11 '15 at 21:01
  • 2
    @barakmanos, except that the subtraction in your macro `MIN` can overflow, e.g. with `MIN(INT_MAX, -1)`. There are branchless definitions of the minimum which avoid this subtraction (using e.g. the expression `a < b` as an integer value). – Maëlan Oct 02 '19 at 21:49

1 Answers1

6

Sure this is portable, the ! operator is guaranteed to give either 0 or 1 as a result. This then is promoted to whatever type is needed by the other operand.

As others observed, your if-else version has the disadvantage to evaluate twice, but you already know that, and if there is no side effect you are fine.

What surprises me is that you say that this is faster. I would have thought that modern compilers perform that sort of optimization themselves.

Edit: So I tested this with two compilers (gcc and clang) and the two values for the configuration.

In fact, if you don't forget to set -DNDEBUG=1, the 0 version with ?: is much better for gcc and does what I would have it expected to do. It basically uses conditional moves to have the loop branchless. In that case clang doesn't find this sort of optimization and does some conditional jumps.

For the version with arithmetic, the performance for gcc worsens. In fact seeing what he does this is not surprising. It really uses imul instructions, and these are slow. clang gets off better here. The "arithmetic" actually has optimized the multiplication out and replaced them by conditional moves.

So to summarize, yes, this is portable, but if this brings performance improvement or worsening will depend on your compiler, its version, the compile flags that you are applying, the potential of your processor ...

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
  • I would have thought so too. What compiler(s) do you use? I've been using Clang. I just updated things above to include a full working C11 example program that clocks itself, if you'd like to try it on your own system. – Todd Lehman Aug 08 '15 at 21:20
  • 1
    Hey, cool, interesting results with gcc vs. clang. Yeah, wow, this really depends on the compiler and the target CPU. Benchmarking/profiling is critical here. In my case, the modular multiplication is a bottleneck in one part of my code, so I need it to be as fast as possible, so a 2x speedup is worth it...but this certainly won't always be the case. – Todd Lehman Aug 08 '15 at 22:04
  • BTW, I rewrote the macros to replace the multiplication (`*!!`) with a bitwise logical and (`&-!!`) and now the *if* seems to run the same speed for me with gcc whether it's branching or branchless. – Todd Lehman Aug 09 '15 at 00:29