I'm trying to assess the performance differences between OpenCL for AMD and Nvidia GPUs. I have a kernel which performs matrix-vector multiplication. I'm running the kernel on two different systems at the moments, my laptop which has an NVidia GT525m with Ubuntu 12.04 and CUDA 4.0 (which contains the OpenCL libraries and headers) and the other is a desktop with an AMD Radeon HD7970 again with Ubuntu 12.04 and the latest Catalyst drivers.
In the kernel I have two #pragma unroll
statements which produce a large speed-up for the Nvidia OpenCL implementation (~6x). However the AMD OpenCL version does not produce any speedup. Looking at the kernel with the AMD APP kernel analyzer gives the error that the unroll is not used because the trip count is not known.
So my question is, does #pragma unroll
work with AMD OpenCL or is there an alternative (perhaps a compiler flag that i'm unaware of). I've included the kernel below
__kernel void mvKernel(__global float* a, const __global float* x, __global float* y, int m, int n)
{
float sum = 0.0f;
__global float* A;
int i;
int j = 0;
int indx = get_global_id(0);
__local float xs[12000];
#pragma unroll
for(i = get_local_id(0); i < n; i+= get_local_size(0)) {
xs[i] = x[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
A = &a[indx];
#pragma unroll 256
for(i = 0; i < n; i++) {
sum += xs[i] * A[j];
j += m;
}
y[indx] = sum;
}
This same kernel produces correct results in both implementations but the #pragma unroll commands don't do anything for the AMD (checked by commenting them out).