Manual loop unrolling with known maximum size

Question

Please take a look at this code in an OpenCL kernel:

uint point_color = 4278190080;
float point_percent = 1.0f;
float near_pixel_size = (...);
float far_pixel_size = (...);
float delta_pixel_size = far_pixel_size - near_pixel_size;
float3 near = (...);
float3 far = (...);
float3 direction = normalize(far - near);

point_position = (...) + 10;
for (size_t p = 0; p < point_count; p++, position += 4)
{
    float3 point = (float3)(point_list[point_position], point_list[point_position + 1], point_list[point_position + 2]);
    float projection = dot(point - near, direction);
    float3 projected = near + direction * projection;
    float rejection_length = distance(point, projected);
    float percent = projection / segment_length;
    float pixel_size = near_pixel_size + percent * delta_pixel_size;
    bool is_candidate = (pixel_size > rejection_length && point_percent > percent);
    point_color = (is_candidate ? (uint)point_list[point_position + 3] | 4278190080 : point_color);
    point_percent = (is_candidate ? percent : point_percent);
}

This code attempts to find the point in a list that is nearest to the line segment between far and near, and assigning its color to point_color and its "percentual distance" into point_percent. (Incidentally, the code seems to be OK).

The number of elements specified by point_count is variable, so I cannot assume too much about it, save for one thing: point_count will always be equal or less than 8. That's a fixed fact in my code and data.

I would like to unroll this loop manually, and I'm afraid I will need to use lots of

value = (point_count < constant ? new_value : value)

for all lines in it. In your experience, will such a strategy increase performance in my kernel?

And yes, I know, I should be performing some benchmarking by myself; I just wanted to ask someone with lots of experience in OpenCL before actually attempting this on my own.

apetranzilla · Answer 1 · 2018-04-11T01:07:01.140

Most OpenCL drivers (that I'm familiar with, at least) support the use of #pragma unroll to unroll loops at compile time. Simply use it like so:

#pragma unroll
for (int i = 0; i < 4; i++) {
    /* ... */
}

It's effectively the same as unrolling it manually, with none of the effort. In your case, this would probably look more like:

if (pointCount == 1) {
    /* ... */
} else if (pointCount == 2) {
    #pragma unroll
    for (int i = 0; i < 2; i++) { /* ... */ }
} else if (pointCount == 3) { 
    #pragma unroll
    for (int i = 0; i < 3; i++) { /* ... */ }
}

I can't say for certain whether there will be an improvement, but there's one way to find out. If pointCount is constant for the local work group for example, it might improve performance, but if it's completely variable, this might actually make things worse.

You can read more about it here.

Manual loop unrolling with known maximum size

1 Answers1