I tried compiling your program with the following command line using PGI 13.10 compiler tools:
pgcc -acc -ta=nvidia,cc20,cuda5.0 -o t1 t1.c -Minfo
And got this output:
scaled:
10, Generating present_or_copy(v1[0:n])
Generating present_or_copyin(v2[0:n])
Generating NVIDIA code
Generating compute capability 2.0 binary
11, Complex loop carried dependence of '*(v2)' prevents parallelization
Loop carried dependence of '*(v1)' prevents parallelization
Loop carried backward dependence of '*(v1)' prevents vectorization
Accelerator scalar kernel generated
14, Sum reduction generated for sum
While that does indicate the compile was "successful", the messages about "prevents parallelization" indicate that the compiler was not successful in really taking advantage of the accelerator. When you see the message Accelerator scalar kernel generated
you are often going to be dissatisfied with the results.
When I run the program as compiled above, I do get a runtime error:
call to cuLaunchKernel returned error 701: Launch out of resources
This is an error coming out of the CUDA runtime subsystem. You may or may not see this kind of error depending on what kind of accelerator device you are trying to run on. We could delve into how to solve this, but it's really beside the point, because your program is not properly structured to take advantage of the accelerator.
The compiler is issuing the "prevents parallelization" messages because it is being strict about the pointers vector1
(or v1
) and vector2
(or v2
). It assumes these pointers could alias over each other, and therefore cannot create a correct parallel program in this case. Since that is probably not your intent (you probably intend v1
and v2
to refer to separate spaces), you can "reassure" the compiler by modifying your scaled
function parameters with the C99 restrict
keyword. This allows the compiler to do its job the way you probably intend.
Here's a modified code and results:
$ cat t1.c
#include <stdio.h>
#include <stdlib.h>
float scaled(float *restrict v1, float *restrict v2, float a, int n)
{
int i;
float sum = 0.0f;
#pragma acc kernels
for(i=0;i<n;i++)
{
v1[i]+=a*v2[i];
sum+=v1[i];
}
return sum;
}
int main(int argc, char* argv[])
{
int n;
float *vector1;
float *vector2;
if( argc > 1 )
n = atoi( argv[1] );
else
n = 100000;
if( n <= 0 ) n = 100000;
vector1=(float*)malloc(n*sizeof(float));
vector2=(float*)malloc(n*sizeof(float));
scaled(vector1, vector2, 3.3, n);
printf("programming done\n");
return 0;
}
$ pgcc -acc -ta=nvidia,cc20,cuda5.0 -o t1 t1.c -Minfo
scaled:
10, Generating present_or_copy(v1[0:n])
Generating present_or_copyin(v2[0:n])
Generating NVIDIA code
Generating compute capability 2.0 binary
11, Loop is parallelizable
Accelerator kernel generated
11, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
14, Sum reduction generated for sum
$ ./t1
programming done
$
If you're not able to get results like these, there may be a problem with the machine/tools you are trying to run on.