CUDA kernel for determining primes slower than OpenMP code - how can I optimize it?

Question

To practice programming with CUDA in C++. I did an exercise which consists in displaying the prime numbers less than N. For each code I commented out the last display loop to compare only the calculation times.

The Makefile :

all: sum sum_cpu

nothing: 
    g++ -O3 -std=c++17 -o premier.exe premier.cpp -Wall
cpu:
    g++ -O3 -std=c++17 -o cpu_premier.exe premier.cpp -Wall -fopenmp  
gpu:
    nvcc  --compile --compiler-options -O3 -o gpu_premier.o gpu_premier.cu -gencode  arch=compute_50,code=sm_50
    nvcc --link --compiler-options -O3 -o gpu_premier.exe gpu_premier.o
clear: 
    rm *.exe *.o

Here is my code to parallelize with openMP which runs in 1,306s :

#include <math.h>
#include <iostream>

const int N = 2<<22;
bool * premiers;

bool est_premier(int nbr){

    if ( nbr==1 || nbr == 0) return false;
    else if (nbr == 2) return true;
    else if (nbr % 2 == 0) return false;
    else {
        for (int i=3;i<=sqrt(nbr);++i){
            if (nbr % i == 0){
                return false;
            }
        }
    }
    return true;
}


int main(){
    premiers = new bool[N+1];

    # pragma omp parallel for
    for (int i = 0;i<N;++i){
        premiers[i] = est_premier(i);
    }

    /*
    for (int i = 0;i<N;++i){
        if (premiers[i])
            std::cout<<i<<",";
       
    } std::cout<<std::endl;
    */


    delete[] premiers;
}

Here is the corresponding cuda code which runs in 1,613s:


#include <cuda.h>
#include <iostream>
const int N = 2<<22;
bool * premiers_cpu;
bool * premiers_gpu;



__device__
bool est_premier(int nbr){

    if ( nbr==1 || nbr == 0) return false;
    else if (nbr == 2) return true;
    else if (nbr % 2 == 0) return false;
    else {
        for (int i=3;i * i <= nbr ;++i){
            if (nbr % i == 0){
                return false;
            }
        }
    }
    return true;
}


__global__ void kernel_premier(bool * premiers, int size){
    int gtid =  blockIdx.x * blockDim.x  + threadIdx.x ;

    while(gtid < size){
        premiers[gtid] = est_premier(gtid);
        gtid += blockDim.x * gridDim.x;
    }


}




int main(){

    bool * premiers_cpu = new bool[N];
   

    dim3 block (256,1);
    dim3 grid (2048,1,1);

    cudaMalloc(( void **) &premiers_gpu, N * sizeof(bool));
    cudaMemcpy(premiers_gpu,premiers_cpu,N * sizeof(bool),cudaMemcpyHostToDevice);

    kernel_premier<<<grid,block>>>(premiers_gpu,N);

    cudaMemcpy(premiers_cpu,premiers_gpu,N * sizeof(bool),cudaMemcpyDeviceToHost);

   
    /*
    for (int i = 0;i<N;++i){
        if (premiers_cpu[i])
            std::cout<<i<<",";
       
    } std::cout<<std::endl;
    */

    delete[] premiers_cpu;
    cudaFree(premiers_gpu);
    


}

Intuitively, I thought that lowering the size of the grid and increasing the number of blocks per grid would make the program more efficient, but it's the opposite. Here my program in cuda is less efficient than my program with OpenMP how to explain it and how to fix it?

Note the GPU version recompute `i * i` for each iteration while `int(sqrt(n))` can be precomputed (like in the first code). Computing `sqrt(n)` is expensive but it should worth it for relatively large numbers. It can even be precomputed once for the whole block. Consider using `i+=2`. This will not only make things much faster, but also make the GPU implementation more efficient compared to the CPU one due to a lower *warp divergence* (which is an issue here). Testing only even numbers (except 2) should be even better on GPUs. — Jérôme Richard, Jan 29 '23 at 19:14
By the way, for a faster algorithm, consider the [sieve of Atkin](https://en.wikipedia.org/wiki/Sieve_of_Atkin). Alternatively, a naive sieve of Eratosthenes should be relatively good on GPUs (though it is less efficient in general). — Jérôme Richard, Jan 29 '23 at 19:17

score 3 · Accepted Answer · answered Jan 31 '23 at 18:10

Well, first, the trivial observation of: You shouldn't even compute this at run-time. Just set the result at compile-time, for a run-time which is nothing but printing.

Also, a second semi-trivial observation: Make sure you don't time the printing! Terminal I/O is quite slow. Maybe that's what's taking most of your time, with both alternatives?

Finally, and more to the point: Divergence of threads kills your performance. Half of the threads don't even go into the loop, another 1/6 only do a single iteration - while on average, a thread iterates Theta(log((N)) times.

But addressing this is quite untrivial. Your best "choice" would be switching your algorithm altogether (perhaps even on the CPU), and using something like Erathostenes' sieve: Start with factors, not potential primes, and mark the non-primes. That's not a trivial parallelization challenge either, actually.

Anyway, with the algorith, as-is, consider a more careful separation. Say... instead of returning early for even nbrs - don't iterate them to begin with: Don't check gtid for primality, but rather 2*gtid + 1; and halve the size of your grid. Of course, you can do the same for the CPU - but the GPU will benefit more.

More generally, consider setting your block size to a value for which you can rule out some non-primes to begin with and only check the suspects. For example, if the block size is 5 (not a good idea! just an illustration) then you can check primality for length-10 runs of numbers. How come? Because within the sequence 10k,10k+1,10k+2,10k+3,10k+4,10k+5,10k+6,10k+7,10k+8,10k+9 - the 1'st, 3'rd, 5'th, 7'th and 9th numbers are multiples of 2; and the 6'th (and 1'th) is a multiple of 5. So, half the threads, and some reduction in the divergence. Take this further (runs of 235 values, or 235*7) with a small constant lookup table for choosing the value to check, and you will see some improvement.

Caveat: This might still not be your worst enemy. Always use profiling (e.g. using nsight-compute) to see where you're spending most of your time and how.

CUDA kernel for determining primes slower than OpenMP code - how can I optimize it?

1 Answers1