I'm running a Monte Carlo code for particle simulation, written in CUDA. Basically, in each step I calculate the velocity of each particle and update its position. The velocity is directly proportional to the path length. For a given material, the path length has a certain distribution. I know the probability density function of this path length. I now try to sample random numbers according to this function via rejection method. I would describe my CUDA knowledge as limited. I understood, that it is preferable to create large chunks of random numbers at once instead of multiple small chunks. However, for the rejection method, I generate only two random numbers, check a certain condition and repeat this procedure in the case of failure. Therefore I generate my random numbers on the kernel.
Using the profiler / nvvp I noticed, that basically 50% of my time is spend during the rejection method.
Here is my question: Are there any ways to optimize the rejection methods?
I appreciate every answer.
CODE
Here is the rejection method.
__global__ void rejectSamplePathlength(float* P, curandState* globalState,
int numParticles, float sigma, int timestep,curandState state) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles) {
bool success = false;
float p;
float rho1, rho2;
float a, b;
a = 0.0;
b = 10.0;
curand_init(i, 0, 0, &state);
while (!success) {
rho1 = curand_uniform(&globalState[i]);
rho2 = curand_uniform(&globalState[i]);
if (rho2 < pathlength(a, b, rho1, sigma)) {
p = a + rho1 * (b - a);
success = true;
}
}
P[i] = abs(p);
}
}
The pathlength function in the if statement computes a value y=f(x) on the kernel. I"m pretty sure, that curand_init is problematic in terms of time, but without this statement, each kernel would generate the same numbers?