I am bringing a code to the GPU. This code has a kernel that makes use of a private array. This means that the array is declared inside the kernel loop.
When I port the code to OpenACC I get buggy results. To me, it looks like the array is shared between GPU vector threads and this cause several race conditions.
I organized the following example with also external calls because that's the way my original code looks like.
header.h:
#define N 100000
#define K 16
#pragma acc routine
void assign_i_to_privj(int * priv, int j, int i);
#pragma acc routinetnumpy
void add_privi_to_sum(int * priv, int i, int *sum);
main.c:
#include "header.h"
int main(void){
int A[N];
#pragma acc data copy(A)
{
#pragma acc parallel loop
for(int i=0; i<N;i++){
int priv[K];
int sum=0;
int j=0;
while(1){
if(j>=K) break;
assign_i_to_privj(priv, j, i);
j++;
}
j=0;
while(1){
if(j>=K) break;
add_privi_to_sum(priv, j, &sum);
j++;
}
sum/=K; // now sum == i;
A[i]=sum;
}
}
//now A[i] == i
for(int i=0; i<123; i++) printf("A[%d]=%d ",i, A[i]);
printf("\n");
return 0;
}
f.c:
#include "header.h"
void assign_i_to_privj(int *priv, int j, int i){
priv[j]=i;
}
void add_privi_to_sum(int *priv, int j, int *sum){
(*sum)+=priv[j];
}
I can see the compiler version with cc -v
that returns Export PGI=/opt/pgi/17.5.0
.
To compile:
cc -g -lnvToolsExt -O2 -acc -ta=tesla:cc60 -c11 -mp -Minfo -Mlarge_arrays -c main.c &&
cc -g -lnvToolsExt -O2 -acc -ta=tesla:cc60 -c11 -mp -Minfo -Mlarge_arrays -c f.c &&
cc -g -lnvToolsExt -O2 -acc -ta=tesla:cc60 -c11 -mp -Minfo -Mlarge_arrays f.o main.o -o acc.exe &&
srun -n 1 acc.exe
The code should set all A[i]
elements equal to i
. When I run this code with OpenACC support, I get completely wrong results. My guess is a race condition.
The version without openacc
compile and runs correctly. At the end of the run A[i]==i
So, my question is: how can I make a small array to be private to all GPU threads with OpenACC?