Micro-optimizing a c++ comparison function

Question

I have a Compare() function that looks like this:

inline bool Compare(bool greater, int p1, int p2) {
  if (greater) return p1>=p2;
  else return p1<=p2;
}

I decided to optimize to avoid branching:

inline bool Compare2(bool greater, int p1, int p2) {
  bool ret[2] = {p1<=p2,p1>=p2};
  return ret[greater];
}

I then tested by doing this:

bool x = true;
int M = 100000;
int N = 100;

bool a[N];
int b[N];
int c[N];

for (int i=0;i<N; ++i) {
  a[i] = rand()%2;
  b[i] = rand()%128;
  c[i] = rand()%128;
}

// Timed the below loop with both Compare() and Compare2()
for (int j=0; j<M; ++j) {
  for (int i=0; i<N; ++i) {
    x ^= Compare(a[i],b[i],c[i]);
  }
}

The results:

Compare(): 3.14ns avg
Compare2(): 1.61ns avg

I would say case-closed, avoid branching FTW. But for completeness, I replaced

a[i] = rand()%2;

with:

a[i] = true;

and got the exact same measurement of ~3.14ns. Presumably, there is no branching going on then, and the compiler is actually rewriting Compare() to avoid the if statement. But then, why is Compare2() faster?

Unfortunately, I am assembly-code-illiterate, otherwise I would have tried to answer this myself.

EDIT: Below is some assembly:

_Z7Comparebii:
.LFB4:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    pushq   %rbp
    .cfi_def_cfa_offset 16
    movq    %rsp, %rbp
    .cfi_offset 6, -16
    .cfi_def_cfa_register 6
    movl    %edi, %eax
    movl    %esi, -8(%rbp)
    movl    %edx, -12(%rbp)
    movb    %al, -4(%rbp)
    cmpb    $0, -4(%rbp)
    je      .L2
    movl    -8(%rbp), %eax
    cmpl    -12(%rbp), %eax
    setge   %al
    jmp     .L3
.L2:
    movl    -8(%rbp), %eax
    cmpl    -12(%rbp), %eax
    setle   %al
.L3:
    leave
    ret
    .cfi_endproc
.LFE4:
    .size   _Z7Comparebii, .-_Z7Comparebii
    .section        .text._Z8Compare2bii,"axG",@progbits,_Z8Compare2bii,comdat
    .weak   _Z8Compare2bii
    .type   _Z8Compare2bii, @function
_Z8Compare2bii:
.LFB5:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    pushq   %rbp
    .cfi_def_cfa_offset 16
    movq    %rsp, %rbp
    .cfi_offset 6, -16
    .cfi_def_cfa_register 6
    movl    %edi, %eax
    movl    %esi, -24(%rbp)
    movl    %edx, -28(%rbp)
    movb    %al, -20(%rbp)
    movw    $0, -16(%rbp)
    movl    -24(%rbp), %eax
    cmpl    -28(%rbp), %eax
    setle   %al
    movb    %al, -16(%rbp)
    movl    -24(%rbp), %eax
    cmpl    -28(%rbp), %eax
    setge   %al
    movb    %al, -15(%rbp)
    movzbl  -20(%rbp), %eax
    cltq
    movzbl  -16(%rbp,%rax), %eax
    leave
    ret
    .cfi_endproc
.LFE5:
    .size   _Z8Compare2bii, .-_Z8Compare2bii
    .text

Now, the actual code that performs the test might be using inlined versions of the above two functions, so there is a possibility this might be the wrong code to analyze. With that said, I see a jmp command in Compare(), so I think that means that it is branching. If so, I guess this question becomes: why does the branch predictor not improve the performance of Compare() when I change a[i] from rand()%2 to true (or false for that matter)?

EDIT2: I replaced "branch prediction" with "branching" to make my post more sensible.

`optimize to avoid branch prediction` Isn't this an oxymoron? — Lightness Races in Orbit, Apr 02 '13 at 17:07
You'll have to share the assembly code since what happens depends a lot on which compiler you're using and at what optimization level. — Raymond Chen, Apr 02 '13 at 17:07
You didn't set the seed. Maybe the compiler is smart enough to know what `rand()` returns in this case? Just a quick thought. Also you should really compare the assembly. Even though you're assembly-code-illiterate, you can still show the difference. — Zeta, Apr 02 '13 at 17:08
It's hard to tell without seeing what the compiler is doing. — Mysticial, Apr 02 '13 at 17:12
What is the timing for a[i] = false? Faster than the 1.61ns? — SinisterMJ, Apr 02 '13 at 17:22
@LukaRahne: I'm using g++ 4.4.3-4ubuntu5 on an Intel Xeon X5570. @ others: bear with me while I figure out how to post the assembly. — dshin, Apr 02 '13 at 17:25
@AntonRoth Compare() and Compare2() clocked in at ~3.14ns and ~1.61ns, respectively, in all 3 a[i] cases: rand(), true, and false. — dshin, Apr 02 '13 at 17:27
Wait, which is faster? Your last comment just said that `Compare()` is faster, but you have the it the other way in your question. — Mysticial, Apr 02 '13 at 17:28
BTW what you want to avoid is branches, not branch prediction - branch prediction is a *good* thing. — Mark Ransom, Apr 02 '13 at 17:30
@Mysticial Compare2() is faster, I typed my comment to AntonRoth hastily and then edited it quickly thereafter. — dshin, Apr 02 '13 at 17:31
@MarkRansom: Unless it predicts incorrectly a large percentage of the time... — Oliver Charlesworth, Apr 02 '13 at 17:32
In which case the term would be to "avoid branch mispredictions" - which can also be done by avoiding branches completely. — Mysticial, Apr 02 '13 at 17:34
Yes. In this case the branch prediction will be wrong a large percentage of the time. — drescherjm, Apr 02 '13 at 17:34
@OliCharlesworth, even if it were wrong 100% of the time it wouldn't be any worse than stalling the pipeline would it? And realistically it shouldn't be wrong more than 50%. — Mark Ransom, Apr 02 '13 at 17:52
The question posed in the first comment on this question, happily ignored by everybody, is "is misprediction worse than no prediction"? — Lightness Races in Orbit, Apr 02 '13 at 17:59
@MarkRansom Well, if you were compare two near identical processors: one with and one without branch prediction, then it's likely that the one without will be faster than the one with it and mispredicts 100% of the time. That's because misprediction cleanup has significant overhead. — Mysticial, Apr 02 '13 at 17:59
@David notice the `setge` and `setle` instructions? They provide a boolean result from a comparison without doing any branches. Both versions use these instructions. Doesn't answer your question but I thought you would find it interesting. — Mark Ransom, Apr 02 '13 at 18:05
Could you try `(rand()/256)%2` instead of `rand()%2`? Least significant bit of rand() may be well predictable and just as good as `true` or `false` for branch predictor. — Evgeny Kluev, Apr 02 '13 at 18:09
@MarkRansom: That's true, for random data one would expect 50% misprediction rate (I think). It would have to be specific pathological input to cause 100% misprediction. — Oliver Charlesworth, Apr 02 '13 at 18:14
I tried `(rand()%256)<128` instead of `rand()%2`, same result. — dshin, Apr 02 '13 at 19:23
It's easy to tell why both functions perform equally for random/constant first parameter: neither of them uses branch instructions (when inlined/optimized). But I've no idea why `Compare2` is faster. In fact the only difference is that `Compare` uses `cmov` instruction while `Compare2` uses twice as much memory accesses. I would expect `Compare2` to be twice as slow as `Compare`... — Evgeny Kluev, Apr 02 '13 at 20:25
@Evgeny There's clearly a branch going on in `Compare()` - at least when not inlined (and I don't see why inlining would help if the compiler can't deduce the value of greater). g++ 4.5.3 under cygwin creates the expected code for me without surprises (way too many moves going on in the given assembly for me) — Voo, Apr 03 '13 at 00:02

dshin · Accepted Answer · 2013-04-03T16:02:30.790

I think I figured most of this out.

When I posted the assembly for the functions in my OP edit, I noted that the inlined version might be different. I hadn't examined or posted the timing code because it was hairier, and because I thought that the process of inlining would not change whether or not branching takes place in Compare().

When I un-inlined the function and repeated my measurements, I got the following results:

Compare(): 7.18ns avg
Compare2(): 3.15ns avg

Then, when I replaced a[i]=rand()%2 with a[i]=false, I got the following:

Compare(): 2.59ns avg
Compare2(): 3.16ns avg

This demonstrates the gain from branch prediction. The fact that the a[i] substitution yielded no improvement originally shows that inlining removed the branch.

So the last piece of the mystery is why the inlined Compare2() outperforms the inlined Compare(). I suppose I could post the assembly for the timing code. It seems plausible enough that some quirk in how functions get inlined might lead to this, so I'm content to end my investigation here. I will be replacing Compare() with Compare2() in my application.

Thanks for the many helpful comments.

EDIT: I should add that the probable reason that Compare2 beats all others is that the processor is able to perform both comparisons in parallel. This was the intuition which led me to write the function the way I did. All other variants essentially require two logically serial operations.

DigitalInBlue · Answer 2 · 2013-04-03T01:26:33.590

I wrote a C++ library called Celero designed to test just such optimizations and alternatives. (Shameless self promotion: https://github.com/DigitalInBlue/Celero)

I ran your cases using the following code:

class StackOverflowFixture : public celero::TestFixture
{
  public:
    StackOverflowFixture()
    {
    }

    inline bool NoOp(bool greater, int p1, int p2) 
    {
      return true;
    }

    inline bool Compare(bool greater, int p1, int p2) 
    {
      if(greater == true)
      {
        return p1>=p2;
      }

      return p1<=p2;
    }

    inline bool Compare2(bool greater, int p1, int p2)
    {
      bool ret[2] = {p1<=p2,p1>=p2};
      return ret[greater];
    }

    inline bool Compare3(bool greater, int p1, int p2) 
    {
      return (!greater != !(p1 <= p2)) | (p1 == p2);
    }

    inline bool Compare4(bool greater, int p1, int p2) 
    {
      return (greater ^ (p1 <= p2)) | (p1 == p2);
    }
};

BASELINE_F(StackOverflow, Baseline, StackOverflowFixture, 100, 5000000)
{
  celero::DoNotOptimizeAway(NoOp(rand()%2, rand(), rand()));
}

BENCHMARK_F(StackOverflow, Compare, StackOverflowFixture, 100, 5000000)
{
  celero::DoNotOptimizeAway(Compare(rand()%2, rand(), rand()));
}

BENCHMARK_F(StackOverflow, Compare2, StackOverflowFixture, 100, 5000000)
{
  celero::DoNotOptimizeAway(Compare2(rand()%2, rand(), rand()));
}

BENCHMARK_F(StackOverflow, Compare3, StackOverflowFixture, 100, 5000000)
{
  celero::DoNotOptimizeAway(Compare3(rand()%2, rand(), rand()));
}

BENCHMARK_F(StackOverflow, Compare4, StackOverflowFixture, 100, 5000000)
{
  celero::DoNotOptimizeAway(Compare4(rand()%2, rand(), rand()));
}

The results are shown below:

[==========]
[  CELERO  ]
[==========]
[ STAGE    ] Baselining
[==========]
[ RUN      ] StackOverflow.Baseline -- 100 samples, 5000000 calls per run.
[     DONE ] StackOverflow.Baseline  (0.690499 sec) [5000000 calls in 690499 usec] [0.138100 us/call] [7241140.103027 calls/sec]
[==========]
[ STAGE    ] Benchmarking
[==========]
[ RUN      ] StackOverflow.Compare -- 100 samples, 5000000 calls per run.
[     DONE ] StackOverflow.Compare  (0.782818 sec) [5000000 calls in 782818 usec] [0.156564 us/call] [6387180.672902 calls/sec]
[ BASELINE ] StackOverflow.Compare 1.133699
[ RUN      ] StackOverflow.Compare2 -- 100 samples, 5000000 calls per run.
[     DONE ] StackOverflow.Compare2  (0.700767 sec) [5000000 calls in 700767 usec] [0.140153 us/call] [7135039.178500 calls/sec]
[ BASELINE ] StackOverflow.Compare2 1.014870
[ RUN      ] StackOverflow.Compare3 -- 100 samples, 5000000 calls per run.
[     DONE ] StackOverflow.Compare3  (0.709471 sec) [5000000 calls in 709471 usec] [0.141894 us/call] [7047504.408214 calls/sec]
[ BASELINE ] StackOverflow.Compare3 1.027476
[ RUN      ] StackOverflow.Compare4 -- 100 samples, 5000000 calls per run.
[     DONE ] StackOverflow.Compare4  (0.712940 sec) [5000000 calls in 712940 usec] [0.142588 us/call] [7013212.893091 calls/sec]
[ BASELINE ] StackOverflow.Compare4 1.032500
[==========]
[ COMPLETE ]
[==========]

Given this test, it looks like Compare2 is the best option for this micro-optimization.

EDIT:

Compare2 Assembly (The best case):

cmp r8d, r9d
movzx   eax, dl
setle   BYTE PTR ret$[rsp]
cmp r8d, r9d
setge   BYTE PTR ret$[rsp+1]
movzx   eax, BYTE PTR ret$[rsp+rax]

Compare3 Assembly (The next-best case):

xor r11d, r11d
cmp r8d, r9d
mov r10d, r11d
setg    r10b
test    dl, dl
mov ecx, r11d
sete    cl
mov eax, r11d
cmp ecx, r10d
setne   al
cmp r8d, r9d
sete    r11b
or  eax, r11d

I'm not a fan of how you did you the benchmarking. The measured times are dominated by the cost of `rand()`, masking the true performance difference between the variants. — dshin, Apr 03 '13 at 02:44
True that rand() is expensive, but the cost is identical for each test, therefore it can be factored out. What should be compared is a baselined (relative) time. That shows what is truly faster and by how much. Measuring average execution time is actually incorrect. Reference: http://www.codeproject.com/Articles/525576/Celero-A-Cplusplus-Benchmark-Authoring-Library — DigitalInBlue, Apr 03 '13 at 10:47
Given the baseline, Compare2 is 1.014870 times slower than the baseline measurement and Compare3 is 1.027476 times slower. — DigitalInBlue, Apr 03 '13 at 10:55
Oh, I see. I guess I would have just rather seen the statement, "Compare2 is 1.85x faster than Compare3" (since (1.027476-1)/(1.014870-1) = 1.85). The way your numbers are reported makes the magnitude of the improvement not obvious, in my opinion. — dshin, Apr 03 '13 at 13:09

Sharath · Answer 3 · 2013-04-02T18:41:42.473

1

How about this...

inline bool Compare3(bool greater, int p1, int p2) 
{
  return (!greater != !(p1 <= p2)) | (p1 == p2);
}

or

inline bool Compare4(bool greater, int p1, int p2) 
{
  return (greater ^ (p1 <= p2)) | (p1 == p2);
}

edited Apr 02 '13 at 18:41

answered Apr 02 '13 at 17:52

Sharath

1,627
2
18
34

2

It seems to me that `Compare3(true,1,1)!=Compare3(false,1,1)`, which would make the function incorrect. Same for `Compare4()`. – dshin Apr 02 '13 at 18:01
1

Add `| (p1 == p2)` and be happy. – Griwes Apr 02 '13 at 18:06
Hmm, I didn't test the code. No compiler in my home machine. Will check now. – Sharath Apr 02 '13 at 18:13
Damn, I missed that condition. Fixed it now. Thanks. – Sharath Apr 02 '13 at 18:42
Looks like Compare4 is the fastest. – Sharath Apr 02 '13 at 19:03
Actually, by my tests, I see that Compare2 is fastest. I get Compare2 at about 1.65ns and Compare4 at about 2.55ns. – dshin Apr 02 '13 at 19:36
1

This doesn't really address the question (i.e. "why the difference between Compare() and Compare2()?") – Oliver Charlesworth Apr 02 '13 at 20:38
@OliCharlesworth that's an understatement ;) – J.N. Apr 03 '13 at 01:10

Micro-optimizing a c++ comparison function

3 Answers3