Profiling a project, I noticed that calls to curand_uniform
are having issues with global memory access
. For example a random number generator
created with a kernel as follows:
__device__ curandState randGPU_d_state[200000];
__global__ void
initCurand(const unsigned long seed)
{
int i = blockIdx.x * blockDimx. + threadIdx.x;
if (i < 200000)
curand_init(seed, i, 0, &randGPU_d_state[i]);
}
Access later on in a subsequent kernel via something like the following, where threadIdx.x < 200000
:
float temp = curand_uniform(&randGPU_d_state[threadIdx.x]);
leads NVIDIA Visual Profiler
to throw this line up when profiling 'Global Memory Access Pattern'
as 'Global Load L2 Transactions/Access = 31.8, Ideal Transactions/Access = 8[ 12000 L2 transactions for 377 total executions ] '
.
In fact I get 7 such warnings for the exact same line.
In addition, if I use curand_normal
instead, NVIDIA Visual Profiler also warns of problems on lines 310, 312, 313, 315 and 316 of curand_normal.h
with similarly bad ratios of Ideal Transactions/Access
of 4 of 8.
I believe I am accessing the state coalesced (although I do not knot the details of memory inside it, but am nevertheless accessing the state variable coalesced), and therefore, why are these bad ratios preset?