Helping the compiler to optimize branchy code sequences

Question

I have code sequences in C/C++ that contain lots of branches, something like this:

if( condition1 )
    return true;
if( condition2 )
    return true;
...
return false;

(which is equivalent to return condition1 || condition2 || ...;)

Evaluating each of the conditions requires several memory accesses (all read-only) but the compiler misses an important optimization opportunity by not moving memory accesses before a previous condition is evaluated. The reason being that condition2's memory accesses may segfault when condition1 is true. Well I know that they don't and I would like the compiler to do the sensible thing and mix some of these code sequences where it is appropriate for performance, e.g. to exploit instruction-level-parallelism. I also don't want to change the condition to a logical or (not short-circuit) because one of the branches will likely jump out.

Any ideas on how this can be accomplished (preferably using gcc)?

Thanks.

Maybe I'm misunderstanding something here, but why aren't you simply using `else if (condition2)`, if you don't want it to be evaluated if condition1 is true? — reko_t, Jun 09 '11 at 08:48
Rewrite your code to remove these branches and use a more Object-Oriented solution? — RvdK, Jun 09 '11 at 08:48
actually even i thought of if-else, but that wud not give instruction level parallelism. — Nik, Jun 09 '11 at 08:49
@reko_t: does not change anything, `return` already guarantees that. @PoweRoy: bold comment, it might worsen the situation (OO abstraction are not meant to increase performance), do you have facts to back it up ? — Matthieu M., Jun 09 '11 at 08:55
@hans: could you show us a bit of how the booleans come to be ? it would help us experiment with your situation. — Matthieu M., Jun 09 '11 at 08:56
@jalf: The moment I read his comment I knew you'd be all over that. — GManNickG, Jun 09 '11 at 09:11
@GMan: hey, I like OOP, in moderation ;) but it's rarely a good answer when looking for performance. — jalf, Jun 09 '11 at 09:14
@jalf: true, OOP used well typically answers other questions - maintainability, scalability, extensibility - without any impact on performance either way — Tony Delroy, Jun 09 '11 at 09:20
Considering the existence of speculative execution, I'm not so sure that the compiler miss something. Could you be more explicit about how the conditions relate, giving an example for instance. — AProgrammer, Jun 09 '11 at 09:21
How do you GCC is "missing" the opportunity? It's optimizer does pick out many things to do with ordering: I've had such cases before and what came out is quite different than what went in. — edA-qa mort-ora-y, Jun 09 '11 at 09:23
To optimize performance you should indeed not look for OOP in general but maybe this could remove problems later on. If after some time it has to check for more conditions it will _could_ decrease performance. It just smells fishy currently --> "If condition1 is true it could segfault on condition2" — RvdK, Jun 09 '11 at 09:27
@PoweRoy: no, that's short circuiting. Here's another example: `if (vec.size() >= 3 && vec[2] == 14)` -- the language is *designed* to work like this. If that scares you, you really need to learn to use your tools correctly. — jalf, Jun 09 '11 at 09:30
How is that example related to the topic? These 2 checks in your if statement prevents the code to throw an exception based on the same object (vec). If in the topic condition1 and condition2 are completely unrelated but makes it one segfault if the other is true it just doesn't sound right. It is of course based on the example given which doesn't give an overview how the statements are used. — RvdK, Jun 09 '11 at 10:41
@PoweRoy: completely out of character, but to be pedantic (and pursue something of dubious relevance to the question), `vec[2]` where `size() < 3` would exhibit undefined behaviour - only `at()` is specified as throwing ;-P — Tony Delroy, Jun 09 '11 at 14:51
Note: *"Helping the compiler to optimize [...]"* will ***always*** be compiler dependent. And generally dependent on other features of the platform you're running on. — dmckee --- ex-moderator kitten, Jun 09 '11 at 15:31
@PoweRoy Yes, you're right, he should use object oriented code, preferably with virtual methods and dynamically allocated objects. That will give him a definite performance boost. — Christian Rau, Jun 10 '11 at 07:12

Tony Delroy · Answer 1 · 2011-06-10T06:44:42.687

Evaluating each of the conditions requires several memory accesses

Why don't you avoid short-circuit evaluation within individual conditions, but let it happen for the or-ing of conditions?

Using operators that are not short-circuit for built-in types

Exactly how you achieve the former depends on the nature of those conditions (i.e. condition1, condition2 in your code) - given you show nothing about them I can only talk in generalities: where they internally contain short-circuit operators, instead convert the boolean value to an integer representation and use e.g. bitwise-OR (or even '+' or '*' if it reads better and works in your particular usage). Bitwise operators are generally safer as they're lower precedence though - only have to be careful if your conditions already include bitwise operators.

To illustrate:

OLD: return (a > 4 && b == 2 && c < a) ||   // condition1
            (a == 3 && b != 2 && c == -a);  // condition2

NEW: return (a > 4 & b == 2 & c < a) ||
            (a == 3 & b != 2 & c == -a);

Also be careful if you used implicit conversion of numbers/pointers to bool before... you want to normalise them to bools so their least-significant bits reflect their boolean significance:

OLD: return my_int && my_point && !my_double;
NEW: return bool(my_int) & bool(my_point) & !my_double;  // ! normalises before bitwise-&

You might also want to benchmark with...

     bool condition1 = a > 4 & b == 2 & c < a;
     bool condition2 = a == 3 & b != 2 & c == -a;
     return condition1 || condition2;

...which might be faster - possibly only in the overall "return false" case and perhaps when the last conditionN or two are the deciding factor in a "return true".

User-defined operators avoid short-circuit evaluation

Seperately, short-circuit evaluation is disabled for objects with overloaded logical operators, which provides another avenue for you to do your checks using the existing notation, but you'll have to change or enhance your data types.

Thoughts

More generally, you'll only benefit from this if you've a large number of assertions combined in each individual condition - more so if the function tends to run through to return false.

"AProgrammer" makes a great point too - with speculative execution available on modern CPUs, the CPU may already be ahead of the ordering implied by short-circuit evaluation (in some special mode than either avoids or suppresses any memory faults from dereferencing invalid pointers, divides by 0 etc). So, it's possible that the entire attempt at optimisation may prove pointless or even counterproductive. Benchmarking of all alternatives is required.

score 5 · Answer 2 · answered Jun 09 '11 at 08:50

5

Could you just move the parts of the condition yourself?

ie

 const bool bCondition1Result = <condition1>;
 const bool bCondition2Result = <condition2>;

and so on

For better optimisation still ... re-jig the order of your conditions so that the most hit one is the first to be checked. This way it will early out more often than not (This may make precious little difference).

answered Jun 09 '11 at 08:50

Goz

61,365
24
124
204

+1: evaluating each condition separately like that invites parallelism across (but not within) conditionN terms, though if the optimiser's not really clued in it may result in evaluation of all the conditionN terms even though an early one is enough to determine the function's return value. Would be interesting to hear hans feedback if he benchmarks it. – Tony Delroy Jun 09 '11 at 09:04

score 2 · Answer 3 · answered Jun 10 '11 at 04:59

2

Take a look at the __builtin_expect functions provided by gcc. When one defines likely / unlikely macros as is done with the linux kernel, these can be used intuitively with little impact on code readability.

answered Jun 10 '11 at 04:59

jreitz

21
1

score 1 · Answer 4 · answered Jun 10 '11 at 07:18

Don't bother. The CPU already has out-of-order execution, speculative execution and branch prediction. It's pretty unlikely that any difference at this level could make any difference whatsoever. Instruction-level parallelism is done implicitly by the CPU, not explicitly by the compiler. Perhaps GCC didn't do anything because there's nothing to be gained.

On that note, you'd have to have one hell of a condition to make a difference in the running time of a non-trivial application.

Oh, and logical or is Standard-guaranteed to be short circuit.

Helping the compiler to optimize branchy code sequences

4 Answers4

Using operators that are not short-circuit for built-in types

User-defined operators avoid short-circuit evaluation

Thoughts