Why an “if-else” statement (in GPUs code) will cut the performance in half

Question

I read this article:

And someone added a comment in which he wrote:

Since GPUs are SIMD any code with an “if-else” statement will cut your performance in half. Half of the cores will execute the if part of the statement while half of the cores are in idle and then the other half cores will do the else calculations while the first half of the cores remain idle.

I can't understand why ?

Why using GPU (i.e OpenCL) when using if-else the performance will cut in a half ?

GPU's design by nature doesn't favor branch operation, you should consider adapting your codes to the this characteristic. — tibetty, Aug 17 '17 at 11:44

BlameTheBits · Answer 1 · 2017-08-17T12:28:12.023

21

Branches in general do not affect performance but branch divergence does. That is, two threads taking different paths (e.g. one fulfills the if condition, the other does not). Because all threads of a GPU execute the same "line of code" some threads have to wait while the code which is not part of their path is executed.
Well, that is not really true as only all threads in one warp (NVIDIA) or wavefront (AMD) execute the same "line of code". (Currently, the warp size of NVIDIA GPUs is 32 and the wafefront size of AMD GPUs is 64.)

So if there is an if-else block in your kernel the worst case scenario is indeed a 50% performance drop. And even worse: If there are n possible branches the performance can decrease down to 1/n of the performance without divergence (that is no branches or all threads in a warp/ wafefront are taking the same path). Of course for such scenarios your whole kernel must be embedded in an if-else (or switch) construct.

But as written above this will only happen if the threads which are taking different paths are in the same warp/wafefront. So it is up to you to write your code/ rearrange data/ chose the algorithm/ ... to avoid branch divergence as far as possible.

Tl;DR: There can be branches but if different threads are taking different branches they have to be in different warps/ wafefronts to avoid divergence and thus performance loss.

edited Aug 17 '17 at 12:28

answered Aug 17 '17 at 12:23

BlameTheBits

850
1
6
22

2

Most of the time people are too afraid of branches on a GPU as soon as they read about divergence. However, most of the time kernels are running nowhere near peak performance but are limited in throughput by memory bandwidth, shared memory or lack of available warps. Then divergence has only very limited impact on performance. – Jan Lucas Aug 18 '17 at 06:04
@JanLucas: You are right that almost never peak performance is reached but branch divergence makes it even worse. Of course a bisection of performance for a kernel that could run near peak performance would result in a bigger performance loss but only in absolute numbers, not in relative numbers. What I think is a better argument not to worry about divergence too much is that often that often the kernel split into branches only in a small part of the kernel (e.g. only 10% is affected by divergence and thus the performance drop is quite low). – BlameTheBits Aug 19 '17 at 18:35
@JanLucas: But it is never bad to think about divergence as it often goes hand in hand with a bad data layout which leads to uncoalesced memory accesses and therefor a waste of precious memory bandwidth. – BlameTheBits Aug 19 '17 at 18:39
Sometimes a little bit of divergence is just what you pay for a better data layout. If you got the choice between a good data layout and some divergence and bad data layout without divergence, almost always go for the good data layout. And branch divergence often does not make it worse, e.g: if you are limited by memory bandwidth than introducing some extra divergence often does not reduce the performance at all. It does not reduce the performance if you replace full stall cycles where nothing is executed with cycles where a partially active warp is executed. – Jan Lucas Jun 05 '18 at 06:47
Do I understand this correctly, that if I have say a (pseudo) code that reads If (A==true) {//do something} else {//do something else} the ones doing //something must wait for the ones doing //something else? Meaning that there is NO crazy performance loss if I just have simple if statements like for example setting some variable with if (A==true) {a=1} else {a=2}. The performance loss is only if there is significant amount of code INSIDE the if statements? – Jonathan Lindgren Aug 06 '18 at 11:03
@JonathanLindgren Sorry for the late answer. Yes, that's correct. – BlameTheBits Oct 19 '18 at 08:01

Why an “if-else” statement (in GPUs code) will cut the performance in half

1 Answers1