Using OpenACC to parallelize nested loops

Question

I am very new to openacc and have just high-level knowledge so any help and explanation of what I am doing wrong would be appreciated.

I am trying to accelerate(parallelize) a not so straightforward nested loop that updates a flattened (3D to 1D) array using openacc directives. I have posted a simplified sample code below that when compiled using

pgcc -acc -Minfo=accel test.c

gives the following error:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Code:

#include <stdio.h>
#include <stdlib.h>

#define min(a,b) (a > b) ? b : a
#define max(a,b) (a < b) ? b : a

#define NX 10
#define NY 10
#define NZ 10

struct phiType {
  double dx, dy, dz;
  double * distance;
};

typedef struct phiType Phi;

#pragma acc routine seq
double solve(Phi *p, int index) {
  // for simplicity just returning a value
  return 2;
}

void fast_sweep(Phi *p) {

  // removing boundaries
  int x = NX - 2; 
  int y = NY - 2;
  int z = NZ - 2;

  int startLevel = 3;
  int endLevel   = x + y + z;

  #pragma acc data copy(p->distance[0:NX*NY*NZ])
  for(int level = startLevel; level <= endLevel; level++){
    int ks = max(1, level-(y + z));
    int ke = min(x, level-2);

    int js = max(1, level-(x + z));
    int je = min(y, level-2);

    #pragma acc region
    {
      #pragma acc loop independent
      for(int k = ks; k <= ke; k++){
        #pragma acc loop independent
        for(int j = js; j <= je; j++){
          int i = level - (k + j);
          if(i > 0 && i <= z){
            int index = i * NX * NY + j * NX + k;
            p->distance[index] = solve(p, index);
          }
        }
      }
    }
  }
}


void create_phi(Phi *p){

  p->dx = 1;
  p->dy = 1;
  p->dz = 1;

  p->distance = (double *) malloc(sizeof(double) * NX * NY * NZ);
  for(int i = 0; i < NZ; i++){
    for(int j = 0; j < NY; j++){
      for(int k = 0; k < NX; k++){
        int index = i * NX * NY + j * NX + k;
        p->distance[index] = (i*j*k == 0) ? 0 : 1;
      }
    }
  }

}


int main()
{
  printf("start \n");

  Phi *p = (Phi *) malloc(sizeof(Phi));
  create_phi(p);

  printf("calling fast sweep \n");
  fast_sweep(p);

  printf(" print the results \n");
  for(int i = 0; i < NZ; i++){
    for(int j = 0; j < NY; j++){
      for(int k = 0; k < NX; k++){
        int index = i * NX * NY + j * NX + k;
        printf("%f ", p->distance[index]);
      }
      printf("\n");
    }
    printf("\n");
  }

  return 0;
}

Instead of using the region and loop directives, using the

#pragma acc kernels

produces the following error:

solve:
     19, Generating acc routine seq
fast_sweep:
     34, Generating copy(p->distance[:1000])
     42, Generating copy(p[:1])
     45, Loop carried dependence due to exposed use of p[:1] prevents parallelization
         Accelerator scalar kernel generated
     47, Loop carried dependence due to exposed use of p[:i1+1] prevents parallelization

I am running this code on

GNU/Linux
CentOS release 6.7 (Final)
GeForce GTX Titan
pgcc 15.7-0 64-bit target on x86-64 Linux -tp sandybridge

I'm working on providing you with a full fix, but one thing you should change right away is to replace `region` with `parallel`. The former was from the old PGI Accelerator model (a precursor to OpenACC) and the latter is the OpenACC syntax. Personally, I'd also add `copyin(p[0:1])` to the data clauses to make clear that `p` is just a pointer to a single instance of `Phi` rather than an array (the compiler has no way to determine this). Even with those changes, you'll probably get errors, which is why I'm only commenting and not posting an answer yet. — jefflarkin, Aug 24 '15 at 20:29
Thanks for the heads up @jefflarkin I will make the replacement right away. — anupshrestha, Aug 24 '15 at 20:34
What's almost certainly going on is the compiler is not correctly mapping some part of your data structure to the device, so the kernel is dereferencing host memory from the device. If you happen to have enabled the "unified memory evaluation package" during compiler installation, then there's a really easy solution. You can use '-ta=tesla:managed" to put all of your pointed into managed memory, which is visible from the host and GPU. You should look at the documentation to see the restrictions that come along with this, but it should at least get you moving again. — jefflarkin, Aug 24 '15 at 21:19

score 5 · Accepted Answer · answered Aug 24 '15 at 21:31

The error is coming from the compute kernel on the GPU dereferencing a CPU pointer. This is a pretty common problem and something that the OpenACC committee is working on solving. Dynamic data structures like these can really cause a lot of problems, so we want to fix it. Here's two possible workarounds for you.

1) Use "managed memory" via the PGI "unified memory evaluation package" option during compiler installation. This is a beta feature, but it will put all of your data into a special type of memory that is visible to both the CPU and GPU. There's a lot of caveats that you should read about in the documentation, most namely that you're limited to the amount of memory available on the GPU and that you cannot access the memory from the CPU while it's being used on the GPU, but it's one possible workaround. Assuming you enabled that option during installation, just add -ta=tesla:managed to your compiler flags to turn it on. I tried this with your code and it worked.

2) Add a pointer to your code so that you're not accessing distance through p, but your accessing it directly, like so:

double *distance = p->distance;
#pragma acc data copy(p[0:1],distance[0:NX*NY*NZ])
  for(int level = startLevel; level <= endLevel; level++){
    int ks = max(1, level-(y + z));
    int ke = min(x, level-2);

    int js = max(1, level-(x + z));
    int je = min(y, level-2);

    #pragma acc parallel
    {
      #pragma acc loop independent
      for(int k = ks; k <= ke; k++){
        #pragma acc loop independent
        for(int j = js; j <= je; j++){
          int i = level - (k + j);
          if(i > 0 && i <= z){
            int index = i * NX * NY + j * NX + k;
            distance[index] = solve(p, index);
          }
        }
      }
    }

I know this can be a pain when there's a lot of data arrays to do this to, but it's a workaround that I've used successfully in a lot of codes. It's unfortunate that this is necessary, which is why we'd like to provide a better solution in a future version of OpenACC.

I hope this helps! If I can come up with a solution that doesn't require the extra pointer, I'll update this answer.

thank you @jefflarkin, this helps me a lot in understanding openacc. I tried both of your workarounds and it works and gives correct result. I also noticed that using managed memory was faster in performance than using the extra pointer in my original code. So, to follow up on this, how portable would the code be when relying on unified managed memory? Will this only work on NVIDIA accelerators that support UVA or unified memory? — anupshrestha, Aug 25 '15 at 16:48
Relying on managed memory will only work on NVIDIA GPUs. Additionally, unless other compilers have similar flags, it'll only work with PGI as well. — jefflarkin, Aug 25 '15 at 16:50
That's what I thought, I will keep that in mind. Thanks for all your help. — anupshrestha, Aug 25 '15 at 19:02
@jefflarkin thank you very much for the fix, is this problem somewhere documented? I tried to use the C++ features as described in http://www.pgroup.com/lit/articles/insider/v6n2a1.htm which is a very nice and clean approach. But in which cases does it need the extra pointer workaround? — dwn, Jul 14 '16 at 14:11

score 1 · Answer 2 · edited Dec 28 '17 at 06:11

Jeff is correct that the OpenACC committee is still working on how to standardize support for aggregate data types with dynamic data members. However with PGI version 14.9 or later, we have added better support for structs as well as C++ classes so in this case you can simplify the code by just adding create(p[0:1]). What will happen is that the compiler will create a device copy of p with memory allocated for just the data members. Then when you do the copy of p->distance, memory will be allocated for "distance" and then attach it to p. (i.e. the run time will fill in the device pointer in the struct).

There are caveats. First is that this behavior has not been standardized so other compilers such as Cray, Pathscale, GCC, and others may have different behavior. Second, order matters. p needs to be created before distance can be attached. Third, more complex data structures become very difficult to manage. As Jeff suggests, using CUDA Unified Memory is a good alternative for managing complex data structures.

If you're interested, much of my GTC2015 presentation discuss this topic (link). The focus of the talk is on C++ Class data management, but is applicable to C structs as well.

Hope this helps, Mat

% cat test1.c
#include <stdio.h>
#include <stdlib.h>

#define min(a,b) (a > b) ? b : a
#define max(a,b) (a < b) ? b : a

#define NX 10
#define NY 10
#define NZ 10

struct phiType {
  double dx, dy, dz;
  double * distance;
};

typedef struct phiType Phi;

#pragma acc routine seq
double solve(Phi *p, int index) {
  // for simplicity just returning a value
  return 2;
}

void fast_sweep(Phi *p) {

  // removing boundaries
  int x = NX - 2;
  int y = NY - 2;
  int z = NZ - 2;

  int startLevel = 3;
  int endLevel   = x + y + z;

  #pragma acc data create(p[0:1]) copy(p->distance[0:NX*NY*NZ])
  for(int level = startLevel; level <= endLevel; level++){
    int ks = max(1, level-(y + z));
    int ke = min(x, level-2);

    int js = max(1, level-(x + z));
    int je = min(y, level-2);

    #pragma acc region
    {
      #pragma acc loop independent
      for(int k = ks; k <= ke; k++){
        #pragma acc loop independent
        for(int j = js; j <= je; j++){
          int i = level - (k + j);
          if(i > 0 && i <= z){
            int index = i * NX * NY + j * NX + k;
            p->distance[index] = solve(p, index);
          }
        }
      }
    }
  }
}


void create_phi(Phi *p){

  p->dx = 1;
  p->dy = 1;
  p->dz = 1;

  p->distance = (double *) malloc(sizeof(double) * NX * NY * NZ);
  for(int i = 0; i < NZ; i++){
    for(int j = 0; j < NY; j++){
      for(int k = 0; k < NX; k++){
        int index = i * NX * NY + j * NX + k;
        p->distance[index] = (i*j*k == 0) ? 0 : 1;
      }
    }
  }

}


int main()
{
  printf("start \n");

  Phi *p = (Phi *) malloc(sizeof(Phi));
  create_phi(p);

  printf("calling fast sweep \n");
  fast_sweep(p);

  printf(" print the results \n");
  for(int i = 0; i < NZ; i++){
    for(int j = 0; j < NY; j++){
      for(int k = 0; k < NX; k++){
        int index = i * NX * NY + j * NX + k;
        printf("%f ", p->distance[index]);
      }
      printf("\n");
    }
    printf("\n");
  }

  return 0;
}

% pgcc -acc -ta=tesla:cc35 -Minfo=accel test1.c -V15.7 ; a.out
solve:
     19, Generating acc routine seq
fast_sweep:
     34, Generating create(p[:1])
         Generating copy(p->distance[:1000])
     45, Loop is parallelizable
     47, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         45, #pragma acc loop gang /* blockIdx.y */
         47, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
start
calling fast sweep
 print the results
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Thanks @Mat for your explanation. Your solution works in this case but when I tried to implement it into the original code although it ran I couldn't get correct results. — anupshrestha, Aug 25 '15 at 16:51
Hi Anup, the incorrect answers could be occurring for other reason unrelated to the data structure. Do you have a reproducing test case or more information? — Mat Colgrove, Aug 25 '15 at 17:23
Mat I don't have another test case for that. I would have to share the whole source code that generates .vti files for which I use paraview to check the results. Jeff's answer does work for me but I will try to investigate why the results are different when I use this. Thanks for the help. — anupshrestha, Aug 25 '15 at 19:01
@MatColgrove thank you for the pointer to the GTC video. Unfortunately the tar archive link is not working any more http://www.pgroup.com/lit/samples/gtc15_S5233.tar. could you put it online again please. — dwn, Jul 14 '16 at 15:15
@dawn You have a typo in the link. It's a lower case "s" in "s5233". www.pgroup.com/lit/samples/gtc15_s5233.tar — Mat Colgrove, Jul 15 '16 at 17:04
Our web master just added a symlink so the upper case version will work as well. — Mat Colgrove, Jul 15 '16 at 17:07
@MatColgrove thank you, i took the link directly from the gtc slides. — dwn, Jul 16 '16 at 05:57
@MatColgrove: I don't see `#pragma acc region` in the OpenACC 2.6 standard, but I see it here. What is it? — Richard, Dec 28 '17 at 06:09
"region" is from the PGI Accelerator model which was one of the three models that was combined to create OpenACC. "region" got turned into OpenACC's "kernels" directive. — Mat Colgrove, Dec 28 '17 at 15:44

Using OpenACC to parallelize nested loops

2 Answers2

Linked