opencl branching vs memory redundancy

Question

I'm processing items in a grid, depending on the item's type a different type of computation/function needs to be performed. But I've read branching is a very bad thing to do between workitems doing the same thing. To circumvent this I could split the grid into a grid per type (I would only need two in this particular case)...

What would be better in this case; Leave the branching in there, or making two grids one for each type? I understand this depends on what happens inside the branch (computational bound) vs how big the grids will be (memory/latency bound).

Are there some ground rules to follow for these kinds of decisions or is there consensus which one is better in general?

Edit: The (spatial) grid is not sparse as is usual with spatial grids, but a dense array (no empty elements) of structs (~200 bytes per struct) which will hold up to about 500.000 elements.

I fill this array from another source, using that source I put either triangles or linesegments in there.

Then using this grid, i'll need to do either linesegment/linesegment or linesegment/triangle collision detection. So the question is whether it will be more efficient to fill two seperate arrays (for sake of argument lets say 250.000 elements x 200 bytes) in this case and have workitems do batch computations for only line/line or line/triangle.. or have one big one of 500.000x200 bytes and have each workitem figure out what computation to perform given a type.

Reduction algortihm to regroup-same-function-items into different arrays. Then using same function for a whole array and different function for another array. Which makes it pipeline-bubble-free. But reduction needs some memory operations which could be slower or faster according to size of selection body. Theres some hardware dependency too. — huseyin tugrul buyukisik, Nov 06 '13 at 14:56

score 1 · Answer 1 · answered Nov 06 '13 at 14:57

1

There is no general rule for this, depends on the case. If you brach a lot of code is obviously that rearranging the memory is better. However if your branch is just 2 instructions, then do not reshape the memory.

I would first classify how many items you have of each type (CPU side or by a simple kernel), and the run a specific kernel for each type of item. However this may not be good for your case.

If you can post some code, maybe we can point you in the right direction.

answered Nov 06 '13 at 14:57

DarkZeros

8,235
1
26
36

Thanks, I was leaning to that conclusion as well (kernel per type), but am not sure as it's my first attempt at opencl. I'm still in "designing" my solution so i dont have actual code, i updated my description a bit to include a bit more details. – Martijnh Nov 06 '13 at 15:41

score 1 · Accepted Answer · answered Nov 06 '13 at 15:07

It depends on the structure of your new grids, and also your old.

Let's take the worst case. Normal rectangular grid (like an image) If every odd item is of type 1 and every even is of type 2. Now basically half of your threads will sit idle in GPU (While the type1 is being counted the type2 threads 'idle'). It's because the items within a workgroup generally share their program counter.

If your new grids are 2 kernel calls and simple "not of type2? return" then it's worse than the first case. However if you manage to make 2 grids on which every item is of the correct type then it's far better to split it.

If your original grid is image with exact 2 halves it probably doesn't matter. Only groups within the boundary will perform extra work.

Branches are not that evil. Just think it so that whenever you have a branch and even a single thread within a workgroup (or whatever is the unit of scheduling in your HW) takes a different direction from others all of the code in both branches will be taken everywhere.

That is also the reason why optimizations such as not performing an expensive computation if some special condition applies do not work in general on GPU, because if the other threads don't fullfill the condition you will still effectively calculate it in every thread.

"If your new grids are 2 kernel calls and simple "not of type2? return" then it's worse than the first case. However if you manage to make 2 grids on which every item is of the correct type then it's far better to split it." I guess this is the solution I should be aiming for then (as Darkzeros also mentioned it), because I should be able to split them completely based on type and process them independently in two kernels.. — Martijnh, Nov 06 '13 at 15:47
Based on your edited description it's better to separate them. Not only for performance but also for conceptual clarity. Monster kernels are horrible to maintain and to understand. — sharpneli, Nov 06 '13 at 15:58

opencl branching vs memory redundancy

2 Answers2