-1

I am new to cuda and am trying to use it to carry out the Sieve of Eratosthenes. The code works for primes below 1000000. Above it i get an unknown kernal launch error. Now I understand this is because I am trying to launch a grid with too many blocks. However if I set the blocks to 1000 I do not get all the prime numbers. I think there may be an issue with the indexing in the kernal but not sure.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <fstream>
#include <stdio.h>

using namespace std;

__global__ static void Sieve(long * sieve, long sieve_size)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx > 1) {
    for (int i = idx+idx; i < sieve_size; i += idx) {

        sieve[i] = 1;
    }
}
}

int main()
{

long  *device_sieve;
long *host_sieve = new long[4000000];
ofstream data("data.csv", ofstream::out);

double sieve_size = 4000000 / sizeof(long);

cudaSetDevice(0);
cudaDeviceSynchronize();
cudaThreadSynchronize();
cudaMalloc((void**)&device_sieve, sizeof(long) * sieve_size);

cudaError_t error1 = cudaGetLastError();
cout << "1" << cudaGetErrorString(error1) << endl;

int block = sqrt(sieve_size);

Sieve << <1, block >> >(device_sieve, sieve_size);
cudaThreadSynchronize();
cudaMemcpy(host_sieve, device_sieve, sizeof(long) * sieve_size, cudaMemcpyDeviceToHost);

cudaError_t error = cudaGetLastError();
cout << "2" << cudaGetErrorString(error) << endl;
cudaFree(device_sieve);

for (int i = 2; i < sieve_size; ++i)
    if (host_sieve[i] == 0)
        data << i << endl;

getchar();
cout << "DONE" << endl;
return 0;
}
brad119
  • 61
  • 1
  • 5

1 Answers1

1

I see a couple of issues with your code. First of all, the line double sieve_size = 4000000 / sizeof(long); doesn't make sense, since that leads to allocating insufficient amount of memory on the GPU.

Take an example of having 8 numbers, each 8 bytes long (which is the standard size of long afaik), so putting them into an array takes up 64 bytes of memory. that would mean that your double sieve_size = 8/sizeof(long) would hold value 1. And then you are allocating sieve_size*sizeof(long) bytes of memory on GPU, which in this case means you're allocating 8 bytes, while you need 64 as in the initial array. So drop the division.

Another thing is that I don't think it is a good idea to use double to represent some amount of memory even though it might not generate a compilation error, it might possibly lead to issues with precision rounding, when you allocate just a couple bytes too little and thus invoke a segmentation fault (if you need to represent discrete entities such as a size of a block of memory or length of an array, integer types are the logical choice). So I would change the line to:

long sieve_size = 4000000;

Now another thing is that there are limits on the size of a block that you can execute. It varies by architecture and you can find the specifics of your using command cudaGetDeviceProperties. As far as I know on most modern cards the limitations are 1024 max threads per block. So you can execute either block of size 32x32x1 or 1024x1x1 etc. If you need to process more data you will have to use a bigger grid than just 1x1x1.

Hope this helps!

Addy
  • 1,400
  • 10
  • 20