0

I am writing GPU code for a problem and I am trying to avoid branching (because I know it is very bad for the GPU)

     new_w = w * 0.8;
     w = some_number;
     h = some_number;
     for (int i = 0; i < num_rectangles; i++)
     {
         // for each 2d vector, get the x, y
         cx = abs(array_of_2d_points[i * 2]);
         cy = abs(array_of_2d_points[i * 2 + 1]);
         // check if the 2d vector is inside the w, h
         if (cx < w / 2 && cy < h / 2)
         {
            // if it is inside the rectangle; update the new_w (which was previously set to 80% of the w)
            new_w = max(cx * 2, new_w);
         }
     }

Is there a more clever way to avoid the branch here?

One way that I can think of is to cast the bool into an int

int is_inside = cx < w / 2 && cy < h / 2;
new_w = max(cx * 2 * is_inside, new_w);

But does it actually avoid branching? Does merely using < cause the GPU to branch?

I tried the above approach of casting bool to int. The speed was roughly the same

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
samol
  • 18,950
  • 32
  • 88
  • 127
  • 6
    The compiler may well translate your original code into a branchless sequence of machine instructions. Did you check whether that is the case? Use `cuobjdump --dump-sass` to disassemble the object code or executable. Rule of thumb: don't worry about this kind of local branching. – njuffa Mar 21 '19 at 06:02
  • Besides that^, this part of the kernel looks memory bound anyway. Also giving every thread some work but throwing away half of it without using the results does not make the code faster either. – BlameTheBits Mar 21 '19 at 13:50

0 Answers0