CUDA: curand_uniform Global Memory Alignment

Question

Profiling a project, I noticed that calls to curand_uniform are having issues with global memory access. For example a random number generator created with a kernel as follows:

__device__ curandState randGPU_d_state[200000];

__global__ void
initCurand(const unsigned long seed)
{
  int i = blockIdx.x * blockDimx. + threadIdx.x;
  if (i < 200000)
    curand_init(seed, i, 0, &randGPU_d_state[i]);
}

Access later on in a subsequent kernel via something like the following, where threadIdx.x < 200000:

float temp = curand_uniform(&randGPU_d_state[threadIdx.x]);

leads NVIDIA Visual Profiler to throw this line up when profiling 'Global Memory Access Pattern' as 'Global Load L2 Transactions/Access = 31.8, Ideal Transactions/Access = 8[ 12000 L2 transactions for 377 total executions ] '.

In fact I get 7 such warnings for the exact same line.

In addition, if I use curand_normal instead, NVIDIA Visual Profiler also warns of problems on lines 310, 312, 313, 315 and 316 of curand_normal.h with similarly bad ratios of Ideal Transactions/Access of 4 of 8.

I believe I am accessing the state coalesced (although I do not knot the details of memory inside it, but am nevertheless accessing the state variable coalesced), and therefore, why are these bad ratios preset?

Could you kind of add a more explicit question somewhere in there? — , Dec 14 '15 at 10:23

score 3 · Accepted Answer · answered Dec 14 '15 at 14:45

3

Your assumption about coalesced memory access is incorrect. If you run something like this:

#include <stdio.h> 
#include <stdlib.h> 
#include <cuda.h> 
#include <curand_kernel.h> 

__device__ curandState randGPU_d_state[200000];

__global__ void
initCurand()
{
    printf("%ld\n", sizeof(randGPU_d_state[0]));
}

int main()
{
    initCurand<<<1,1>>>();
    cudaDeviceReset();
    return 0;
}

you will see that it prints 48 for the sizeof(curandState). There is no way I am aware of to have fully coalesced access to an array of types which are that large.

answered Dec 14 '15 at 14:45

talonmies

70,661
34
192
269

I see, so it is inevitable? I can't get rid of this when using the curand libraries? – dogAwakeCat Dec 15 '15 at 05:40
@James: well yes, it is inevitable if you want to store and access a huge number of random generator states like that in global memory. But it isn't obvious *why* you would ever need to do that. I can't see a use case where that would be necessary – talonmies Dec 15 '15 at 09:12
Perhaps I am going about it the wrong way then. I believe both [this](http://stackoverflow.com/questions/22425283/how-could-we-generate-random-numbers-in-cuda-c-with-different-seed-on-each-run) and [this](http://stackoverflow.com/questions/15247522/cuda-random-number-generation) use a similar approach although instead of declaring `randGPU_d_state[200000]` they declare as a pointer and then `cudaMalloc((void**) &randGPU_d_state_Synapses, 2000 * sizeof(curandState));` – dogAwakeCat Dec 16 '15 at 04:03

CUDA: curand_uniform Global Memory Alignment

1 Answers1