1

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.

My code is so simple. When I try to compile following code compiler returns me these messages

PGCC-S-0155-Illegal context for pragma: acc  parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors

My code structure:

#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }

I've also tried to change my code to use routine directives. But I couldn't compile again

#pragma acc routine workers
foo(...)
{

  #pragma acc parallel loop
  for(j = ss; j< ( ee + ss); j++)
  {
    // << computation >>
  }
}

#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
  // << computation >>

  int ss = A[tid].start;
  int ee = A[tid].end;

  foo(...);

}

I've tried of course only with routine (seq,worker,gang) without inner parallel loop directive. It has been compiler but dynamic parallelism hasn't been activated.

    37, Generating acc routine worker
         Generating Tesla code
         42, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
         Loop is parallelizable

How am I supposed to use dynamic parallelism in OpenACC?

grypp
  • 405
  • 2
  • 15

2 Answers2

3

How am I supposed to use dynamic parallelism in OpenACC?

Although nested regions (which would presumably use dynamic parallelism) is a new feature in the OpenACC 2.0 specification, I don't believe it is implemented yet in PGI 15.7. PGI 15.7 represents a partial implementation of the OpenACC 2.0 specification.

This limitation is documented in the PGI 15.7 release notes that should ship with your PGI 15.7 compiler (pgirn157.pdf) in section 2.7 (those release notes are currently available here):

OpenACC 2.0 Missing Features

‣ The declare link directive for global data is not implemented.

‣ Nested parallelism (parallel and kernels constructs within a parallel or kernels region) is not implemented.

Based on the comments, there is some concern about #pragma acc routine worker, so here is a fully worked example with PGI 15.7 of that:

$ cat t1.c
#include <stdio.h>
#include <stdlib.h>
#define D1 4096
#define D2 4096
#define OFFS 2

#pragma acc routine worker
void my_set(int *d, int len, int val){
  int i;
  for (i = 0; i < len; i++) d[i] += val+OFFS;
}

int main(){


  int i,*data;
  data = (int *)malloc(D1*D2*sizeof(int));
  for (i = 0; i < D1*D2; i++) data[i] = 1;

#pragma acc kernels copy(data[0:D1*D2])
  for (i = 0; i < D1; i++)
    my_set(data+(i*D2), D2, 1);

  printf("%d\n", data[0]);

  return 0;
}
$ pgcc -acc -ta=tesla -Minfo=accel t1.c -o t1
my_set:
      8, Generating acc routine worker
         Generating Tesla code
         10, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
         Loop is parallelizable
main:
     20, Generating copy(data[:16777216])
     21, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         21, #pragma acc loop gang /* blockIdx.x */
$ ./t1
4
$

Note that the gang parallelism has been performed at the outer loop, and the worker parallelism has been performed in the inner (routine) loop.

This method does not depend on dynamic parallelism (instead, it relies on a partitioning of parallelism between worker at the routine level and gang at the caller level) and will not invoke dynamic parallelism.

The native use of dynamic parallelism (CDP) is not supported currently in PGI 15.7. It should be possible to call (i.e. interoperate with) other functions (e.g. CUDA, or libraries) that make use of CDP from OpenACC code, but currently, natively it is not used (and not supported) in PGI 15.7

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thank you for quick answer. Ok now it makes sense nested directives. But still i haven't understood why i can't compiler with routine directive. According to document (section 4.6), it is supported directive for nvidia gpus – grypp Aug 12 '15 at 15:17
  • `routine` will work. Try `routine seq` and get that working first. If you want help with a specific issue, you are supposed to provide a specific failing example (i.e. an [MCVE](http://stackoverflow.com/help/mcve) ) which you have not. Provide a short, simple code that attempts to get `routine` working, and I will help you with that. Note that using the `workers` clause has some [specific requirements about the calling environment for that routine](http://104.239.134.127/sites/default/files/213462%2010_OpenACC_API_QRG_HiRes.pdf). In any event I would not expect `routine` to invoke CDP. – Robert Crovella Aug 12 '15 at 15:27
  • I extended question. I mean yes routine may work. But my issue is that I'd like activate dynamic parallelism in some way while using PGI OpenACC. – grypp Aug 12 '15 at 15:47
  • 1
    It's not supported, currently, natively. – Robert Crovella Aug 12 '15 at 15:58
  • but how it's not supported. rdc support is mentioned in documentation. I haven't found any very simple real example about cuda dynamic parallelism from pgi. – grypp Aug 12 '15 at 16:26
  • support for rdc (relocatable device code) doesn't imply support for cdp (cuda dynamic parallelism). I don't know what you mean by "how it's not supported". I don't know how I can be any clearer than what I've already covered in my answer. It's not currently supported. – Robert Crovella Aug 12 '15 at 16:53
  • Uh totally my mistake. I don't know why, but I always associate rdc and cdp. Yes, that's right. Thank you very much Robert Crovella. Your comments were really really helpful and useful for me. – grypp Aug 13 '15 at 11:42
0

try replacing "#pragma acc parallel loop" with #pragma acc loop"

JimBamFeng
  • 709
  • 1
  • 4
  • 20