How to make a per-frame branch optimization-friendly?

Question

Suppose I have a main loop that updates different things per frame:

int currentFrame = frame % n;
if ( currentFrame == 0 )
{
   someVar = frame;
}
else if ( currentFrame == 1 )
{
   someOtherVar = x;
}
...
else if ( currentFrame == n - 1 )
{
   someMethod();
}

Can I make it more friendly for the branch predictor? Can the branch predictor determine that each block will be executed once every n frames? Is there a branch-oblivious alternative (doubtful, assume the blocks have different enough logic in them)?

Note that will full optimizations on, a switch doesn't make much difference (if any).

Allow me to ask the obvious question: is this in a part of the code that will make a noticeable difference? — Mark Ransom, May 07 '14 at 13:36
@MarkRansom yes. If an alternative exists that is, which I doubt, but I'm hopeful for. — Luchian Grigore, May 07 '14 at 13:37
This seems like a subtle variation on the for/switch pattern, except that the frame "ends" in between. How does the main loop work? Would it be possible to unroll it by `n`? — harold, May 07 '14 at 13:39
Is `n` a known constant? If so, you can unroll the loop, although it would be ugly. — interjay, May 07 '14 at 13:39
@interjay cc. harold - can't unroll the loop, these need to happen per-frame (there's a main control loop running on a separate thread) — Luchian Grigore, May 07 '14 at 13:54
Interesting question. But without any source code example its hard to give any answer, imo. What's the size of n? What code do you run in the different branches? Could you create an example that shows this behavior? Also, I hope you ran your code in a profiler and that one actually says that branch misses are high... — milianw, May 07 '14 at 14:54
I feel like it is quite prediction friendly. If n is small, the pattern is likely to be recognized. Note that the compiler might translate this code the same way it would have done for a 'switch'. — johan d, May 07 '14 at 15:12
branch-oblivious: write all of your logic in header/"guaranteed inline" void functions that take currentFrame as a parameter with an early return? should be okay if ratio of N Cases to M instructions/case is small. inline to avoid the frame push/pop. — im so confused, May 07 '14 at 15:14

milianw · Answer 1 · 2014-05-07T15:27:58.110

As I commented above, without any code example, I guess it will be hard to give any useful help here. Can you please post a code snippet that shows a huge number of branch misses?

I just tried something like this:

#include <cstdlib>

__attribute__ ((noinline)) void frame(const int frame) // to prevent automatic unrolling
{
  const int n = 10;
  static int someVar = rand();
  static int someOtherVar = rand();

  const int currentFrame = frame % n;

  if (currentFrame == 0) {
    someVar = frame;
  } else if (currentFrame == 1) {
    someOtherVar += frame;
  } else if (currentFrame == 2) {
    someOtherVar -= someOtherVar;
    someVar = someOtherVar;
  } else if (currentFrame == 3) {
    someVar -= someOtherVar;
  } else if (currentFrame == 4) {
    someVar -= someOtherVar;
    someOtherVar *= someOtherVar;
  } else if (currentFrame == 5) {
    someOtherVar /= someVar + frame;
  } else if (currentFrame == 6) {
    someVar *= someOtherVar - frame;
  } else if (currentFrame == 7) {
    someOtherVar += someVar / (someOtherVar + 1);
  } else if (currentFrame == 8) {
    someVar -= someOtherVar * someVar;
  } else if (currentFrame == n - 1) {
    someOtherVar = frame;
    someVar = frame + 1;
  }
}

int main(int argc, char** argv)
{
  int iterations = 100000000;
  if (argc > 1) {
    iterations = std::atoi(argv[1]);
  }

  for (int i = 0; i < iterations; ++i) {
    frame(i);
  }

  return 0;
}

But that's not reproducing your findings:

Performance counter stats for './a.out 100000000':

        591.088374      task-clock (msec)         #    0.999 CPUs utilized          
                60      context-switches          #    0.102 K/sec                  
                5      cpu-migrations            #    0.008 K/sec                  
              272      page-faults               #    0.460 K/sec                  
    1,665,803,234      cycles                    #    2.818 GHz                     [50.25%]
  <not supported>      stalled-cycles-frontend  
  <not supported>      stalled-cycles-backend   
    3,741,605,478      instructions              #    2.25  insns per cycle         [75.14%]
    1,050,201,459      branches                  # 1776.725 M/sec                   [75.14%]
            11,115      branch-misses             #    0.00% of all branches         [74.64%]

      0.591689393 seconds time elapsed

For n=4 I think branch predictors will detect the pattern. You'd get more misses for larger n. — interjay, May 07 '14 at 15:13
if it can detect the pattern for 4, why should it not detect it for, say, 10? updated the code - same behavior. — milianw, May 07 '14 at 15:28
Check the assembly output. It could be that the compiler has optimized all the cases away, since none of the computation results are being used. — Mark Ransom, May 07 '14 at 15:35
Branch predictors have limited storage space to store pattern history for each branch, so there is an upper limit. I don't know how big it would be in modern processors. — interjay, May 07 '14 at 15:37
Increased it locally to 20, no difference. Shows again that we need some input from the op on the size of n etc. pp. @MarkRansom: Assembly shows that the branches are in there. Also the number of branches reported from perf depends on the number of conditionals I use. — milianw, May 07 '14 at 15:41

How to make a per-frame branch optimization-friendly?

1 Answers1