Is there a way to unroll loops in an AMD OpenCL kernel with the compiler?

Question

I'm trying to assess the performance differences between OpenCL for AMD and Nvidia GPUs. I have a kernel which performs matrix-vector multiplication. I'm running the kernel on two different systems at the moments, my laptop which has an NVidia GT525m with Ubuntu 12.04 and CUDA 4.0 (which contains the OpenCL libraries and headers) and the other is a desktop with an AMD Radeon HD7970 again with Ubuntu 12.04 and the latest Catalyst drivers.

In the kernel I have two #pragma unroll statements which produce a large speed-up for the Nvidia OpenCL implementation (~6x). However the AMD OpenCL version does not produce any speedup. Looking at the kernel with the AMD APP kernel analyzer gives the error that the unroll is not used because the trip count is not known. So my question is, does #pragma unroll work with AMD OpenCL or is there an alternative (perhaps a compiler flag that i'm unaware of). I've included the kernel below

__kernel void mvKernel(__global float* a, const __global float* x, __global float* y, int m, int n)
{
    float sum = 0.0f;
    __global float* A;
    int i;
    int j = 0;
    int indx = get_global_id(0);
    __local float xs[12000];
#pragma unroll 
    for(i = get_local_id(0); i < n; i+= get_local_size(0)) {
        xs[i] = x[i];
    } 
    barrier(CLK_LOCAL_MEM_FENCE);
    A = &a[indx];
#pragma unroll 256
    for(i = 0; i < n; i++) {
        sum += xs[i] * A[j];
        j += m;
    }
    y[indx] = sum;
}

This same kernel produces correct results in both implementations but the #pragma unroll commands don't do anything for the AMD (checked by commenting them out).

Anteru · Accepted Answer · 2012-11-19T21:04:58.180

9

It's not documented, but it should actually work with #pragma unroll. Can you check the compiler log to see if the unroll is applied? I'm not sure if the kernel analyzer uses the same compiler as the OpenCL runtime, you might want to check.

Otherwise, if you know that n comes in chunks of 256, you can unroll manually by having one loop over blocks of 256 elements and another one inside with a fixed size of 256, which might be easier to unroll. This will surely solve the problem that the trip count is not known statically.

However, keep in mind unrolling a loop is usually not that big of a win anyway, as you don't have many registers to cache your computation. The increased register pressure from the loop unrolling might lead to register spilling, which is even slower. You should check how fast the kernel actually is on the AMD card. A newer NVIDIA OpenCL compiler might also not benefit any more from the unroll pragma.

edited Nov 19 '12 at 21:04

answered Nov 19 '12 at 20:19

Anteru

19,042
12
77
121

I don't have access to the AMD machine at the moment, but from what I can remember the kernel was taking around 3.7ms on the AMD card with or without the unrolls whereas the Nvidia takes ~0.7ms with the unroll, ~1.17ms without the unroll and 2.88 ms if I compile the kernel with the flag '-cl-opt-disable' which turns off all compiler optimisation, so it looks like a lot of the speed up isnt actually coming from the unroll. I'll look at the compiler log tomorrow and see what that gives. – andymr Nov 19 '12 at 21:32
The unroll is being applied, i guess i just need to optimise my code for the AMD architecture better – andymr Nov 20 '12 at 14:09
Loop unrolling can cause code size expansion, and thus i-cache misses. But how would it cause increase in register pressure? I didn't follow that. I mean the live scope of the registers don't change because the loop was unrolled. I guess the number of simultaneous registers alive could increase, if a later compiler optimization pass does some code motion that overlaps the live regions. Is that what you mean? But you'd think compilers would keep track of that when doing code motion to avoid spill-fills. – Hashman Dec 02 '19 at 18:06
It can unroll, then try to move loads forwards, and that would increase register pressure. – Anteru Dec 07 '19 at 18:57

score 0 · Answer 2 · answered Jul 19 '23 at 10:25

Answering for the next person stumbling on this question (like me :-)

Since OpenCL 2.0 there is the attribute "opencl_unroll_hint" you can use. https://man.opencl.org/attributes-loopUnroll.html

__attribute__((opencl_unroll_hint(2)))
while (*s != 0)
    *p++ = *s++;

Is there a way to unroll loops in an AMD OpenCL kernel with the compiler?

2 Answers2