open ACC : no output with PGI compiler

Question

I'm undergraduate student who is studying open ACC in a internship.

I made basic sample code myself for testing PGI workstation compiler(version 14.1) works well, and it worked with command line instruction 'pgcc -acc lab04.c -Minfo'

But there is a weird problem I have. @@ There is no output anywhere !! @@

I'm sure compiler works well, and kernels too. But there must be an errors I guess.

Could u guys help me ?

I'll post entire code.

#include <stdio.h>
#include <stdlib.h>


float scaled(float* v1, float* v2, float a, int n)
{
int i;
float sum = 0.0f;

#pragma acc kernels loop
for(i=0;i<n;i++)
{
    v1[i]+=a*v2[i];
    sum+=v1[i];
}

return sum;
}

int main(int argc, char* argv[])
{
int n;

float *vector1;
float *vector2;

if( argc > 1 )
    n = atoi( argv[1] );
else
    n = 100000;
if( n <= 0 ) n = 100000;


vector1=(float*)malloc(n*sizeof(float));
vector2=(float*)malloc(n*sizeof(float));


scaled(vector1, vector2, 3.3, n);


printf("programming done\n");

return 0;

}

What do you mean by "no output"? The only output you should see is "programming done", which is the only printf statement in the code. — Ruyk, Feb 05 '14 at 10:09

score 2 · Answer 1 · answered Feb 05 '14 at 17:06

I tried compiling your program with the following command line using PGI 13.10 compiler tools:

pgcc -acc -ta=nvidia,cc20,cuda5.0 -o t1 t1.c -Minfo

And got this output:

scaled:
 10, Generating present_or_copy(v1[0:n])
     Generating present_or_copyin(v2[0:n])
     Generating NVIDIA code
     Generating compute capability 2.0 binary
 11, Complex loop carried dependence of '*(v2)' prevents parallelization
     Loop carried dependence of '*(v1)' prevents parallelization
     Loop carried backward dependence of '*(v1)' prevents vectorization
     Accelerator scalar kernel generated
 14, Sum reduction generated for sum

While that does indicate the compile was "successful", the messages about "prevents parallelization" indicate that the compiler was not successful in really taking advantage of the accelerator. When you see the message Accelerator scalar kernel generated you are often going to be dissatisfied with the results.

When I run the program as compiled above, I do get a runtime error:

call to cuLaunchKernel returned error 701: Launch out of resources

This is an error coming out of the CUDA runtime subsystem. You may or may not see this kind of error depending on what kind of accelerator device you are trying to run on. We could delve into how to solve this, but it's really beside the point, because your program is not properly structured to take advantage of the accelerator.

The compiler is issuing the "prevents parallelization" messages because it is being strict about the pointers vector1 (or v1) and vector2 (or v2). It assumes these pointers could alias over each other, and therefore cannot create a correct parallel program in this case. Since that is probably not your intent (you probably intend v1 and v2 to refer to separate spaces), you can "reassure" the compiler by modifying your scaled function parameters with the C99 restrict keyword. This allows the compiler to do its job the way you probably intend.

Here's a modified code and results:

$ cat t1.c
#include <stdio.h>
#include <stdlib.h>

float scaled(float *restrict v1, float *restrict v2, float a, int n)
{
  int i;
  float sum = 0.0f;

  #pragma acc kernels
  for(i=0;i<n;i++)
  {
    v1[i]+=a*v2[i];
    sum+=v1[i];
  }

  return sum;
}

int main(int argc, char* argv[])
{
  int n;

  float *vector1;
  float *vector2;

  if( argc > 1 )
    n = atoi( argv[1] );
  else
    n = 100000;
  if( n <= 0 ) n = 100000;


  vector1=(float*)malloc(n*sizeof(float));
  vector2=(float*)malloc(n*sizeof(float));


  scaled(vector1, vector2, 3.3, n);


  printf("programming done\n");

  return 0;

}

$ pgcc -acc -ta=nvidia,cc20,cuda5.0 -o t1 t1.c -Minfo
scaled:
     10, Generating present_or_copy(v1[0:n])
         Generating present_or_copyin(v2[0:n])
         Generating NVIDIA code
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     14, Sum reduction generated for sum
$ ./t1
programming done
$

If you're not able to get results like these, there may be a problem with the machine/tools you are trying to run on.

score 1 · Answer 2 · answered Feb 06 '14 at 22:11

FYI, I was able to help this user on the PGI User Forums (see: http://www.pgroup.com/userforum/viewtopic.php?t=4164). Besides the pointer aliasing issue Robert points out, he was using an older NVIDIA GPU which is no longer targeted by default and needed to add "-ta=tesla:cc10".

Note that the "tesla" target was added in 14.1. For earlier versions, please replace "tesla" with "nvidia".

open ACC : no output with PGI compiler

2 Answers2