I have the following bit of code to sort double values on my GPU:
void bitonic_sort(double *data, int length) {
#pragma acc data copy(data[0:length], length)
{
int i,j,k;
for (k = 2; k <= length; k *= 2) {
for (j=k >> 1; j > 0; j = j >> 1) {
#pragma acc parallel loop gang worker vector independent
for (i = 0; i < length; i++) {
int ixj = i ^ j;
if ((ixj) > i) {
if ((i & k) == 0 && data[i] > data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
if ((i & k) != 0 && data[i] < data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
}
}
}
}
}
}
This is a bit slower on my GPU than on my CPU. I'm using GCC 6.1. I can't figure out, how to run the whole code on my GPU. So far, only the parallel loop is executed on the cpu and it switches between CPU and GPU for each one of the outer loops.
I'd like to run the whole content of the function on the GPU, but I can't figure out how. One major problem for me now is that the GCC implementation currently doesn't allow nested parallelism, so I can't use a parallel construct inside another parallel construct. Is there any way to get around that?
I've tried putting a kernels construct on top of the first loop but that slows it down by a factor of about 10. If I use a parallel construct above the first loop instead, the result isn't sorted any more, which makes sense. The two outer loops need to be executed sequentially for the algorithm to work.
If you have any other suggestions on how I could improve performance, I would be grateful as well.