Influencing branchiness when branch behaviour is known

Question

Before I begin, yes, I'm aware of the compiler built-ins __builtin_expect and __builtin_unpredictable (Clang). They do solve the issue to some extent, but my question is about something neither completely solves.

As a very simple example, suppose we have the following code.

void highly_contrived_example(unsigned int * numbers, unsigned int count) {
    unsigned int * const end = numbers + count;
    for (unsigned int * iterator = numbers; iterator != end; ++ iterator)
        foo(* iterator % 2 == 0 ? 420 : 69);
}

Nothing complicated at all. Just calls foo() with 420 whenever the current number is even, and with 69 when it isn't.

Suppose, however, that it is known ahead of time that the data is guaranteed to look a certain way. For example, if it were always random, then a conditional select (csel (ARM), cmov (x86), etc) possibly would be better than a branch.⁰ If it were always in highly predictable patterns (e.g. always a lengthy stream of evens/odds before a lengthy stream of the other, and so on), then a branch would be better.⁰ __builtin_expect would not really solve the issue if the number of evens/odds were about equal, and I'm not sure whether the absence of __builtin_unpredictable would influence branchiness (plus, it's Clang-only).

My current "solution" is to lie to the compiler and use __builtin_expect with a high probability of whichever side, to influence the compiler to generate a branch in the predictable case (for simple cases like this, all it seems to do is change the ordering of the comparison to suit the expected probability), and __builtin_unpredictable to influence it to not generate a branch, if possible, in the unpredictable case.¹ Either that or inline assembly. That's always fun to use.

⁰ Although I have not actually done any benchmarks, I'm aware that even using a branch may not necessarily be faster than a conditional select for the given example. The example is only for illustrative purposes, and may not actually exhibit the problem described.
¹ Modern compilers are smart. More often than not, they can determine reasonably well which approach to actually use. My question is for the niche cases in which they cannot reasonably figure that out, and in which the performance difference actually matters.

I feel like this is impossible to answer in general. Modern branch predictors are pretty smart and will quickly pick up on patterns like alternating true and false and then a lot of true and then a lot of false... — Nelfeal, Oct 14 '22 at 17:35
I'm well aware of that. Half of my question is about the case where it is known that the input is totally random and patternless, and therefore inherently not very predictable. — Mona the Monad, Oct 14 '22 at 17:37
For guaranteed patterns, I would just unroll them. For unpredictable patterns, I would look at the generated assembly with and without `__builtin_unpredictable`, and if it doesn't do what I want, write my own assembly code or branchless C code if possible (and verify that it performs better). — Nelfeal, Oct 14 '22 at 17:46
That's about what I had in mind, yeah. I was only wondering whether there would be a more deterministic solution that isn't up to the whims of the compiler (even if it were mostly correct, anyway). — Mona the Monad, Oct 14 '22 at 17:52
When you write C code or pretty much anything other than assembly, you don't write a program. You write a spec to give the compiler so that it can write a program. If you want complete control (which is what you need in this case, if you really want a more deterministic solution), I'm afraid you either have to write your own compiler or write assembly. — Nelfeal, Oct 14 '22 at 18:01
Which case are you worried about which existing hints can't describe well? The case where it's predictable but you don't know which side to favour, because either one might dominate in any given run, or it might be a nearly even mix in an easy pattern? (e.g. alternating or all-even then all-odd.) So you don't to tell it that it's predictable with one case, because that could deoptimize the other? Usually it's not too bad, although the "fast" path might end up with no taken branches (just 1 not-taken) vs. 2 taken for the other. Instead of 1 taken each, with one also including a not-taken. — Peter Cordes, Oct 16 '22 at 03:20
But IDK, a compiler might inline more aggressively in the expected path, leading to unequal treatment. You should probably edit your question to be more specific about exactly what situation you're thinking of, and what kind of asm code-gen decisions you're hoping to hint the compiler towards or away from. — Peter Cordes, Oct 16 '22 at 03:22

Influencing branchiness when branch behaviour is known

0 Answers0