Cache miss penalty on branching

Question

I wonder is it faster to replace branching with 2 multiplications or no (due to cache miss penalty)?
Here is my case:

float dot = rib1.x*-dir.y + rib1.y*dir.x;

if(dot<0){
    dir.x = -dir.x;
    dir.y = -dir.y;
}

And I'm trying to replace it with:

float dot = rib1.x*-dir.y + rib1.y*dir.x;

int sgn = (dot  < 0.0) - (0.0 < dot ); //returns -1 or 1 (no branching here, tested)
dir.x *= sgn;
dir.y *= sgn;

I fear that on my i7 with 8Mb cache I'll never get cache miss in this test. — tower120, Mar 22 '14 at 23:22
If it's not going to happen, why does it matter? ;) I assume you want to proof this against cores with smaller caches? Why not simply make a test with a massive data set, one even bigger than your i7 could handle? — Pod, Mar 22 '14 at 23:28
The problem with branches is not about cache misses, it's about interrupting the [instruction pipeline](http://en.wikipedia.org/wiki/Instruction_pipeline#Branches). And, btw, when it says "8Mb" of cache, that's the L3 cache, and it's only quoting the total capacity, while cache misses pertain to *cache lines* which are usually around 64 bytes (at least, on i7 it is). — Mikael Persson, Mar 22 '14 at 23:32
@Leeor - dot computed by "rib1" and "dir" variable. They are unknown at compile time, and can by any. — tower120, Mar 22 '14 at 23:48
They are likely to be assigned a register, there's no real memory access in the code above aside from possible register spilling (and you don't have that many variables). By the way - by `x` you mean `dot`, right? — Leeor, Mar 22 '14 at 23:50
If branch misprediction is a significant problem, that type of code can be converted into negate and conditional move (conditional select for some ISAs) instructions. Ideally, the compiler would perform such optimizations, but without predictability information the compiler cannot evaluate the cost of a branch versus the cost of using conditional move. I doubt there is a way to tell the compiler to use conditional move in standard C++, but your compiler may have extensions. — , Mar 22 '14 at 23:53
@Paul A. Clayton - possibility of entering branch is unknown. Consider 50% — tower120, Mar 22 '14 at 23:56
Incidentally, a global 50% probability does not provide predictability information. Twenty taken followed by twenty not-taken would be predicted fairly well (90% typically). With a "loop" predictor, if the branch consistently alternates between taken and not taken (i.e., T,NT,T,NT,T,NT,...), prediction would approach 100%. I rather suspect that FP conditional moves would be faster than your integer evaluation and FP multiply. Some SIMD instruction sets also provide comparisons that set all bits in a data element if true, left shifting 32 bits and xoring would (I believe) conditionally negate. — , Mar 23 '14 at 00:17
@Paul A. Clayton - I just wonder how it works? In university we learned only about 2 bit prediction (this and previous state) and it could not predict periodical values. How works predictors with greater "memory" is vague for me. Can you give me some links on this? — tower120, Mar 23 '14 at 13:45

score 2 · Answer 1 · edited May 23 '17 at 11:49

2

Branching does not imply cache miss: only instruction prefetching/pipelining is disturbed, so it's possible you block some SSE optimization at compile-time with it.

On the other side, if x86 instructions are being used only, the speculative execution will let the processor to properly start the execution of the most used branch.

On the other side, if you enter the if for the 50% of the times you are in the worst condition: in this case I'd try to look for SSE pipelining and to have the execution optimized with SSE, probably getting some hints from this post, in line with your second block of code.

However, benchmark your code, check the produced assembler in order to find the best solution for this optimization, and get the proper insight. And eventually keep us updated :)

edited May 23 '17 at 11:49

Community

1
1

answered Mar 22 '14 at 23:45

Sigi

4,826
1
19
23

We're preaching the same thing here: measure twice, cut once. – blockchaindev Mar 22 '14 at 23:47
yeah! - if his code can make proficient use of SSE's I think he will be able to get something more from the second one. But really it depends a lot on the amount of data, use of caches... too many factors are playing on today architectures! – Sigi Mar 22 '14 at 23:52
Assume, that I (and compiler of mine) do not use SSE. Assume that branch entered 50% of times. And in the worst case it will only do this "dir.x = -dir.x; dir.y = -dir.y;" when this is unnecessary (2-4 cycles wasted)? Or not? – tower120 Mar 22 '14 at 23:59
I think that in this case you *should* have that in case 1. branch predictor predict half of the times and pipelines less instructions, the other half, the pipeline is broken by the misprediction - in 2. the pipeline is not broken by mispredictions but there are few more instruction to execute. The two situations are really similar in terms of efficiency. It become important what stages of the pipelines are free, there is sufficient "data pressure" (that is - data are in L1 cache already),... so again you need to test it. I wouldn't be surprised if they result being equivalent. – Sigi Mar 23 '14 at 01:41

blockchaindev · Accepted Answer · 2014-03-23T10:18:05.407

1

The cost of the multiplication depends on several factors, whether you use 32-bit or 64-bit floats, and whether you enable SSE or not. The cost of two float multiplications is 10 cycles according to this source: http://www.agner.org/optimize/instruction_tables.pdf

The cost of the branch also depends on several factors. As a rule of thumb, do not worry about branches in your code. The exact behaviour of the branch predictor on the CPU will define the performance, but in this case you should probably expect that the branch will be unpredictable at best, so this is likely to lead to a lot of branch mispredictions. The cost of a branch misprediction is 10-30 cycles according to this source: http://valgrind.org/docs/manual/cg-manual.html

The best advice anyone can give here is to profile and test. I would guess that on a modern Core i7 the two multiplications should be faster than the branch, if the range of input varies sufficiently as to cause sufficient branch mispredictions as to outweigh the cost of the additional multiplication.

Assuming 50% miss rate, the cost of the branch averages 15 cycles (30 * 0.5), the cost of the float mul is 10 cycles.

EDIT: Added links, updated estimated instruction cost.

edited Mar 23 '14 at 10:18

answered Mar 22 '14 at 23:45

blockchaindev

3,134
3
22
30

1

Assuming no SSE and 50% branch misprediction rate. A branch misprediction is of the order of 18 cycles. A float multiplication is of the order of 10 cycles. – blockchaindev Mar 23 '14 at 00:09
@fixxer - According to this http://valgrind.org/docs/manual/cg-manual.html branch misprediction is 10-30 cycles. And according to this http://www.agner.org/optimize/instruction_tables.pdf it's 2 float mul took around 10 cycles. Anyway 30*.5 = 15 (branch) vs 10 (mul). In case if this took not 50%.... I'll stay with branching. Thanks. Make answer with this, I'll accept it. – tower120 Mar 23 '14 at 00:21
1

Single precision FP multiply generally takes 4 cycles (DP, 5 cycles), the two multiplies are not dependent, so could be completed in 5 cycles (6 for DP). The two integer compares could execute in parallel and take only 1 cycle, the integer subtraction would add another cycle, but a conversion of `dot` from float to integer and `sgn` from integer to float would probably kill performance. – Mar 23 '14 at 14:19

Cache miss penalty on branching

2 Answers2