OpenACC bitonic sort is much slower on GPU than on CPU

Question

I have the following bit of code to sort double values on my GPU:

void bitonic_sort(double *data, int length) {
  #pragma acc data copy(data[0:length], length)
  {
    int i,j,k;
    for (k = 2; k <= length; k *= 2) {
      for (j=k >> 1; j > 0; j = j >> 1) {
        #pragma acc parallel loop gang worker vector independent
        for (i = 0; i < length; i++) {
          int ixj = i ^ j;
          if ((ixj) > i) {
            if ((i & k) == 0 && data[i] > data[ixj]) {
              _ValueType buffer = data[i];
              data[i] = data[ixj];
              data[ixj] = buffer;
            }
            if ((i & k) != 0 && data[i] < data[ixj]) {
              _ValueType buffer = data[i];
              data[i] = data[ixj];
              data[ixj] = buffer;
            }
          }
        }
      }
    }
  }
}

This is a bit slower on my GPU than on my CPU. I'm using GCC 6.1. I can't figure out, how to run the whole code on my GPU. So far, only the parallel loop is executed on the cpu and it switches between CPU and GPU for each one of the outer loops.

I'd like to run the whole content of the function on the GPU, but I can't figure out how. One major problem for me now is that the GCC implementation currently doesn't allow nested parallelism, so I can't use a parallel construct inside another parallel construct. Is there any way to get around that?

I've tried putting a kernels construct on top of the first loop but that slows it down by a factor of about 10. If I use a parallel construct above the first loop instead, the result isn't sorted any more, which makes sense. The two outer loops need to be executed sequentially for the algorithm to work.

If you have any other suggestions on how I could improve performance, I would be grateful as well.

Try adding the compile flag "-fopenacc-dim=1024:1:128". GCC 6.1 defaults to using a single gang. This flag sets the default to 1024 gangs, 1 worker, and 128 vectors. — Mat Colgrove, Sep 08 '16 at 19:38
I get the following message: `lto1: warning: using vector_length (32), ignoring 128` I assume the other two values are ok though. — Christopher Thonfeld-Guckes, Sep 13 '16 at 07:58
All in all the proposed values slow the algorithm down though. The default values are running about 14 seconds for 1,000,000 elements, while the modified values take about 17 seconds. — Christopher Thonfeld-Guckes, Sep 13 '16 at 08:04

OpenACC bitonic sort is much slower on GPU than on CPU

0 Answers0