3

I'm profiling my code and optimized everything I could, coming down to a function which looks something like this:

double func(double a, double b, double c, double d, int i){
    if(i > 10 && a > b || i < 11 && a < b)
        return abs(a-b)/c;
    else
        return d/c;
}

It is called millions of times during the run of the program and profiler shows me that ~80% of all time is spent on calling abs().

  1. I replaced abs() with fabs() and it gave about 10% speed up which doesn't make much sense to me as I heard multiple times that they are identical for floating point numbers and abs() should be used always. Is it untrue or I'm missing something?

  2. What would be the quickest way to evaluate absolute value for a double which could further improve the performance?

If that matters, I use g++ on linux X86_64.

sashkello
  • 17,306
  • 24
  • 81
  • 109

4 Answers4

6

Do all 3 computations. Stick the result in a 3 element array. Use non branching arithmetic to find the correct array index. Return that result.

Ie,

bool icheck = i > 10;
bool zero = icheck & (a > b);
bool one = !icheck & (b > a);
bool two = !zero & !one;
int idx = one | (two << 1);
return val[idx];

Where val holds the result of the three computations. The use of & instead of && is important.

This removes your branch prediction problems. Finally, make sure the looping code can see the implementation, so the call overhead can be eliminated.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • 1
    I am curious: branch prediction issues come up relatively often, and this "non-branching" selection trick is classic, yet it seems that compilers do not optimize into it even in this (admittedly) easy case. – Matthieu M. May 23 '13 at 06:21
  • To be fair, the non optimized version could be faster if the branches are predictable. I woukd not trust the code av=bove to be faster unless I tested it: I am doing 2-3 times as much work! – Yakk - Adam Nevraumont May 23 '13 at 08:58
  • Yes, possibly, but that's exactly where you'd like a compiler to step in and optimize according to the architecture targetted etc... – Matthieu M. May 23 '13 at 10:59
  • @MatthieuM. it's not just about the targetted architecture - the compiler may have no idea whether the inputs are being fed in such an order that `i > 10 && a > b || i < 11 && a < b` groups them into four (or even two) such that branch prediction works near-perfectly, or the inputs have effectively random relationships to the test. This is a case where run-time measurements and even code self-modification might help. – Tony Delroy May 26 '13 at 07:47
  • Thanks for this answer, it seems like a very interesting approach. I have one doubt, are you sure `int idx = one & (two << 1);` is correct? It's possible that I'm missing something but I believe this will always be 0 (my guess is that when both are true it would result in 01&10). It would work with an `or` or maybe even a `xor`. – llonesmiz Jun 23 '13 at 07:26
4

Interesting question.

double func(double a, double b, double c, double d, int i){
    if(i > 10 && a > b || i < 11 && a < b)
        return abs(a-b)/c;
    else
        return d/c;
}

First thoughts are that:

  • where's the "inline" qualifier?
  • there's lots of potential for branch misprediction, and
  • lots of short-circuit boolean evaluation.

I'm going to assume a is never equal to b - my gut instinct is that there's a 50% chance that's true of your data set, and it allows some interesting optimisations. If it's not true, then I've nothing to suggest that Yakk hasn't already.

double amb = a - b;
bool altb = a < b; // or signbit(amb) if it proves faster for you
double abs_amb = (1 - (altb << 1)) * amb;
bool use_amb = i > 10 != altb;
return (use_amb * abs_amb + !use_amb * d) / c;

One of the aims I was mindful of when structuring the work was to permit some concurrency in a CPU execution pipeline; this could be illustrated like this:

amb    altb    i > 10
   \  /    \     /
  abs_amb  use_amb
        \  /      \
 use_amb*abs_amb  !use_amb*d
             \    /
              + /c
Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • +1 Thanks, that really helped! I however accept Yakk's answer as the idea is similar and he's been first. – sashkello May 23 '13 at 04:48
  • Would be curious to know how much of an improvement this produced. Thanks. – c-urchin May 23 '13 at 17:08
  • @c-urchin, I didn't test the exact situation, my code is a bit more complex, but after using this idea the improvement is ~20% for me. – sashkello May 23 '13 at 23:01
1

Have you tried unrolling the if like so:

double func(double a, double b, double c, double d, int i){
    if(i > 10 && a > b)
        return (a-b)/c;
    if (i < 11 && a < b)
        return (b-a)/c;
    return d/c;
}
Robert McKee
  • 21,305
  • 1
  • 43
  • 57
  • I tried this. It seems to come out slower than either `std::abs` or `fabs`, in VS2010 =/ – paddy May 23 '13 at 02:22
  • Since in my case a > b is more often than a < b, this is actually speeding things up (I don't see another reason). This is quite surprising anyway because comparison operator shouldn't be much different in terms of speed from abs(), shouldn't it? – sashkello May 23 '13 at 02:28
  • Well you are already doing the comparison, so in the first case, it is eliminating the need to even do the abs or fabs at all. – Robert McKee May 23 '13 at 06:01
0

I would look at the assembly produced by calling fabs(). It could be the overhead of a function call. If so, replace it with an inlined solution. If it's really the content of checking for the absolute value that's expensive, try a bitwise and (&) with a bitmask that is 1 everywhere except for the sign bit. I doubt that this would be cheaper than what the standard library vendor's fabs() generates, though.