Iterative Karatsuba algorithm parallelized and vectorized using OpenACC in C++

Question

I'm trying to parallelize iterative version of Karatsuba algorithm using OpenACC in C++. I would like to ask how can I vectorize inner for loop. My compiler shows my this message about that loop:

526, Complex loop carried dependence of result-> prevents parallelization
     Loop carried dependence of result-> prevents parallelization
     Loop carried backward dependence of result-> prevents vectorization

and here the code of two nested loops:

#pragma acc kernels num_gangs(1024) num_workers(32) copy (result[0:2*size-1]) copyin(A[0:size],$
{
    #pragma acc loop gang 
    for (TYPE position = 1; position < 2 * (size - 1); position++) {
        // for even coefficient add Di/2
        if (position % 2 == 0)
            result[position] += D[position / 2];

        TYPE start = (position >= size) ? (position % size ) + 1  : 0;
        TYPE end = (position + 1) / 2;

        // inner loop: sum (Dst) - sum (Ds + Dt) where s+t=i
        #pragma acc loop worker 
        for(TYPE inner = start; inner < end; inner++){
            result[position] += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
            result[position] -= (D[inner] + D[position - inner]);
        }
    }
}

Actually, I'm not sure if it is possible to vectorize it. But if It is, I can't realize what I'm doing wrong. Thank you

score 1 · Answer 1 · answered Apr 16 '18 at 19:05

The "Complex loop carried dependence of result" problem is due to pointer aliasing. The compiler can't tell if the object that "result" points to overlaps with one of the other pointer's objects.

As a C++ extension, you can add the C99 "restrict" keyword to the declaration of your arrays. This will assert to the compiler that pointers don't alias.

Alternatively, you can add the OpenACC "independent" clause on your loop directives to tell the compiler that the loops do not have any dependencies.

Note that OpenACC does not support array reductions, so you wont be able to parallelize the "inner" loop unless you modify the code to use a scalar. Something like:

rtmp = result[position];
#pragma acc loop vector reduction(+:rtmp) 
    for(TYPE inner = start; inner < end; inner++){
        rtmp += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
        rtmp -= (D[inner] + D[position - inner]);
    }
result[position] = rtmp;

Iterative Karatsuba algorithm parallelized and vectorized using OpenACC in C++

1 Answers1