is it safe to relegate __builtin_expect to an inline function?

Question

I'm workon on some C++ code which defines

#define LIKELY(x)   (__builtin_expect((x), 1))

and I was wondering - why not an inline function? i.e. why not

template <typename T> inline T likely(T x) { return __builtin_expect((x), 1); }

(or maybe

inline int likely(int x) { return __builtin_expect((x), 1); }

since x is supposed to be the result of some condition check)

The macro and the function should do basically the same, right? But then I got to wondering: Maybe it's because of __builtin_expect... could it be that it works different when inside an inline helper function?

Note that `__builtin_expect` is only applicable to integral types so you can (and should) pass / return by-value. — 5gon12eder, Dec 18 '15 at 21:27

score 7 · Answer 1 · answered Dec 23 '15 at 02:41

Stay with the tried and trusted macros, even if we all knows that macros are to be avoided in general. The inline functions simply don't work. Alternatively – especially if you are using GCC – forget __builtin_expect altogether and use profile-guided optimization (PGO) with actual profiling data instead.

The __builtin_expect is quite special in that it doesn't actually “do” anything but merely hints the compiler towards what branch will most likely be taken. If you use the built-in in a context that is not a branching condition, the compiler would have to propagate this information along with the value. Intuitively, I would have expected this to happen. Interestingly, the documentation of GCC and Clang is not very explicit about this. However, my experiments show that Clang is obviously not propagating this information. As for GCC, I still have to find a program where it actually pays attention to the built-in so I cannot tell for sure. (Or, in other words, it doesn't matter anyway.)

I have tested the following function.

std::size_t
do_computation(std::vector<int>& numbers,
               const int base_threshold,
               const int margin,
               std::mt19937& rndeng,
               std::size_t *const hitsptr)
{
  assert(base_threshold >= margin && base_threshold <= INT_MAX - margin);
  assert(margin > 0);
  benchmark::clobber_memory(numbers.data());
  const auto jitter = make_jitter(margin - 1, rndeng);
  const auto threshold = base_threshold + jitter;
  auto count = std::size_t {};
  for (auto& x : numbers)
    {
      if (LIKELY(x > threshold))
        {
          ++count;
        }
      else
        {
          x += (1 - (x & 2));
        }
    }
  benchmark::clobber_memory(numbers.data());
  // My benchmarking framework swallows the return value so this trick with
  // the pointer was needed to get out the result.  It should have no effect
  // on the measurement.
  if (hitsptr != nullptr)
    *hitsptr += count;
  return count;
}

make_jitter simply returns a random integer in the range [−m, m] where m is its first argument.

int
make_jitter(const int margin, std::mt19937& rndeng)
{
  auto rnddist = std::uniform_int_distribution<int> {-margin, margin};
  return rnddist(rndeng);
}

benchmark::clobber_memory is a no-op that denies the compiler to optimize the modifications of the vector's data away. It is implemented like this.

inline void
clobber_memory(void *const p) noexcept
{
  asm volatile ("" : : "rm"(p) : "memory");
}

The declaration of do_computation was annotated with __attribute__ ((hot)). It turned out that this influences how much optimizations the compiler applies a lot.

The code for do_computation was crafted such that either branch had comparable cost, giving slightly more cost to the case where the expectation was not met. It was also made sure that the compiler would not generate a vectorized loop for which branching would be immaterial.

For the benchmark, a vector numbers of 100 000 000 random integers from the range [0, INT_MAX] and a random base_threshold form the interval [0, INT_MAX − margin] (with margin set to 100) was generated with a non-deterministically seeded pseudo random number generator. do_computation(numbers, base_threshold, margin, …) (compiled in a separate translation unit) was called four times and the execution time for each run measured. The result of the first run was discarded to eliminate cold-cache effects. The average and standard deviation of the remaining runs was plotted against the hit-rate (the relative frequency with which the LIKELY annotation was correct). The “jitter” was added to make the outcome of the four runs not the same (otherwise, I'd be afraid of too smart compilers) while still keeping the hit-rates essentially fixed. 100 data points were collected in this way.

I have compiled three different versions of the program with both GCC 5.3.0 and Clang 3.7.0 passing them the -DNDEBUG, -O3 and -std=c++14 flags. The versions differ only in the way LIKELY is defined.

// 1st version
#define LIKELY(X) static_cast<bool>(X)

// 2nd version
#define LIKELY(X) __builtin_expect(static_cast<bool>(X), true)

// 3rd version
inline bool
LIKELY(const bool x) noexcept
{
  return __builtin_expect(x, true);
}

Although conceptually three different versions, I have compared 1^st versus 2^nd and 1^st versus 3^rd. The data for 1^st was therefore essentially collected twice. 2^nd and 3^rd are referred to as “hinted” in the plots.

The horizontal axis of the following plots shows the hit-rate for the LIKELY annotation and the vertical axis shows the averaged CPU time per iteration of the loop.

Here is the plot for 1^st versus 2^nd.

As you can see, GCC effectively ignores the hint, producing equally performing code regardless whether the hint was given or not. Clang, on the other hand, clearly pays attention to the hint. If the hit-rate drops low (ie, the hint was wrong), the code is penalized but for high hit-rates (ie, the hint was good), the code outperforms the one generated by GCC.

In case you are wondering about the hill-shaped nature of the curve: that's the hardware branch predictor at work! It has nothing to do with the compiler. Also note how this effect completely dwarfs the effects of the __builtin_expect, which might be a reason for not worrying too much about it.

In contrast, here is the plot for 1^st versus 3^rd.

Both compilers produce code that essentially performs equal. For GCC, this doesn't say much but as far as Clang is concerned, the __builtin_expect doesn't seem to be taken into account when wrapped in a function which makes it loose against GCC for all hit-rates.

So, in conclusion, don't use functions as wrappers. If the macro is written correctly, it is not dangerous. (Apart from polluting the name-space.) __builtin_expect already behaves (at least as far as evaluation of its arguments is concerned) like a function. Wrapping a function call in a macro has no surprising effects on the evaluation of its argument.

I realize that this wasn't your question so I'll keep it short but in general, prefer collecting actual profiling data over guessing likely branches by hand. The data will be more accurate and GCC will pay more attention to it.

nwellnhof · Answer 2 · 2015-12-18T21:25:48.867

1

It's not guaranteed that a compiler inlines an inline function. Most modern compilers treat the inline keyword only as a hint. If you force inlining using __attribute__((always_inline)) with GCC (or __forceinline with MSVC), it shouldn't matter whether you use an inline function or a macro (but even __forceinline may not work). Otherwise, it's possible that the function won't be inlined. For example, GCC doesn't inline functions with optimization turned off. In this case, the resulting code will be considerably slower. I'd stick with the macro to be on the safe side.

edited Dec 18 '15 at 21:25

answered Dec 18 '15 at 21:19

nwellnhof

32,319
7
89
113

1

Well, if optimization is turned off, then it doesn't really matter, I mean `LIKELY()` is just an optimization hint, after all. Still, point taken. – einpoklum Dec 18 '15 at 23:12
But you've not indicated GCC or clang might fail to inline functions with `__attribute__((always_inline))` - or even with just `inline` - when inlining makes sense. Does that actually happen? With optimizations turned on? – einpoklum Dec 18 '15 at 23:15
@einpoklum With optimizations turned on, such a short function will most likely always be inlined. – nwellnhof Dec 19 '15 at 15:53
So you're saying it is actually pretty safe to relegate this to a function? – einpoklum Dec 19 '15 at 16:13

is it safe to relegate __builtin_expect to an inline function?

2 Answers2