-1

if i have

#pragma acc parallel loop gang num_gangs(4) \
 num_workers(5) vector_length(6) private(arrayB)
 {
  for(j=0; j<len; j++)
  {
   ...
  }
 }

region, i assume each of the 4 gangs will have a separate copy of arrayB (for this example, you can assume that arrayB is an integer array with 5 elements).

  1. am i right in assuming that in above case each of 4 gangs has a private copy of arrayB (and that workers and vectors, i.e., 5 workers in a gang will see the single private copy of arrayB as shared among these 5 workers, and similarly vectors)? also, would
#pragma acc parallel loop num_gangs(4) \
 num_workers(5) vector_length(6)
 {
  for(j=0; j<len; j++)
  {
   ...
  }
 }  

be same in terms of private copies of arrayB to the one above?

  1. now assume,
#pragma acc parallel loop gang worker \
 num_gangs(4) num_workers(5) \
 vector_length(6) private(arrayB)
 {
  for(j=0; j<len; j++)
  {
   ...
  }
 }

then, who has private copy of arrayB and who shares single private copy of arrayB? how many private copies of arrayB there are in total?

  1. now assume,
#pragma acc parallel loop gang vector \
 num_gangs(4) num_workers(5) \
 vector_length(6) private(arrayB)
 {
  for(j=0; j<len; j++)
  {
   ...
  }
 }

then, who has private copy of arrayB and who shares single private copy of arrayB? how many private copies of arrayB there are in total?

also, plz let me know if i am missing any other combinations that are possible.

E_net4
  • 27,810
  • 13
  • 101
  • 139
mr02
  • 3
  • 2

1 Answers1

1

The "private" clause applies to the lowest schedule (gang, worker, vector) being used on applied loop.

So a "loop gang private(arr.." will have a private array for each gang that is shared among the workers and vectors within that gang.

A "loop gang worker private(arr.." will have a private array for each worker that is shared among the vectors within that worker.

A "loop gang worker vector private(arr.." will have a private array for each vector that is not shared.

For case #1, the number of private arrays created will depend on the loop schedule applied by the compiler. If you're using the PGI compiler, look at the compiler feedback messages (-Minfo=accel) to see how the loop was scheduled. If this was a typo and you meant to include a "gang" here, then the number of private arrays would equal the number of gangs.

For #2, you have a "gang worker" schedule so the number of private arrays would be the product of the number of gangs and number of workers.

For #3, you have a "gang worker vector" schedule so the number of private arrays would be the product of the number of gangs, number of workers, and vector length.

Note that in general, I don't recommend using "num_workers" or "vector_length" except for more advanced performance tuning when the size of the inner loops are know to be smaller that the default size, or when adjusting for register usage. Otherwise, you're limiting the parallelism of the code.

I also only use "num_gangs" very infrequently. It only make sense to use when you have very large number (or size) of private arrays and limiting the number of gangs allows the private arrays to fit into the GPU's memory. Also, on very rare occasions when the number of gangs needs to be fixed for an algorithm (like for an RNG).

Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11