I'm trying to parallelize iterative version of Karatsuba algorithm using OpenACC in C++. I would like to ask how can I vectorize inner for loop
. My compiler shows my this message about that loop:
526, Complex loop carried dependence of result-> prevents parallelization
Loop carried dependence of result-> prevents parallelization
Loop carried backward dependence of result-> prevents vectorization
and here the code of two nested loops:
#pragma acc kernels num_gangs(1024) num_workers(32) copy (result[0:2*size-1]) copyin(A[0:size],$
{
#pragma acc loop gang
for (TYPE position = 1; position < 2 * (size - 1); position++) {
// for even coefficient add Di/2
if (position % 2 == 0)
result[position] += D[position / 2];
TYPE start = (position >= size) ? (position % size ) + 1 : 0;
TYPE end = (position + 1) / 2;
// inner loop: sum (Dst) - sum (Ds + Dt) where s+t=i
#pragma acc loop worker
for(TYPE inner = start; inner < end; inner++){
result[position] += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
result[position] -= (D[inner] + D[position - inner]);
}
}
}
Actually, I'm not sure if it is possible to vectorize it. But if It is, I can't realize what I'm doing wrong. Thank you