Generating Random Numbers with CUDA via rejection method. Performance problems

Question

I'm running a Monte Carlo code for particle simulation, written in CUDA. Basically, in each step I calculate the velocity of each particle and update its position. The velocity is directly proportional to the path length. For a given material, the path length has a certain distribution. I know the probability density function of this path length. I now try to sample random numbers according to this function via rejection method. I would describe my CUDA knowledge as limited. I understood, that it is preferable to create large chunks of random numbers at once instead of multiple small chunks. However, for the rejection method, I generate only two random numbers, check a certain condition and repeat this procedure in the case of failure. Therefore I generate my random numbers on the kernel.

Using the profiler / nvvp I noticed, that basically 50% of my time is spend during the rejection method.

Here is my question: Are there any ways to optimize the rejection methods?

I appreciate every answer.

CODE

Here is the rejection method.

__global__ void rejectSamplePathlength(float* P, curandState* globalState,
    int numParticles, float sigma, int timestep,curandState state) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles) {
    bool success = false;
    float p;
    float rho1, rho2;
    float a, b;
    a = 0.0;
    b = 10.0;
    curand_init(i, 0, 0, &state);
    while (!success) {
        rho1 = curand_uniform(&globalState[i]);
        rho2 = curand_uniform(&globalState[i]);
        if (rho2 < pathlength(a, b, rho1, sigma)) {
            p = a + rho1 * (b - a);
            success = true;
        }
    }
    P[i] = abs(p);

}
}

The pathlength function in the if statement computes a value y=f(x) on the kernel. I"m pretty sure, that curand_init is problematic in terms of time, but without this statement, each kernel would generate the same numbers?

From these two posts you can get the answer you are seeking for: [Random generator & CUDA](http://stackoverflow.com/questions/15297168/random-generator-cuda) and [Cuda Random Number Generation](http://stackoverflow.com/questions/15247522/cuda-random-number-generation/15252202#15252202). — Vitality, Mar 10 '14 at 10:33

score 1 · Answer 1 · answered Mar 10 '14 at 11:36

1

Maybe you could create a pool of random generated uniform variable in a previous kernel and then you pick your uniform in that pool and cycling over that pool. But it should be large enough to avoid infinite loop..

answered Mar 10 '14 at 11:36

user2076694

806
1
6
10

The problem doesn't seem to be the selection of the realization, which the OP seems to have already solved, but the call of `curand_init` by each thread. – Vitality Mar 10 '14 at 12:53

Generating Random Numbers with CUDA via rejection method. Performance problems

1 Answers1