5

I am using pyCUDA for CUDA programming. I need to use random number inside kernel function. CURAND library doesn't work inside it (pyCUDA). Since, there is lot of work to be done in GPU, generating random number inside CPU and then transferring them to GPU won't work, rather dissolve the motive of using GPU.

Supplementary Questions:

  1. Is there a way to allocate memory on GPU using 1 block and 1 thread.
  2. I am using more than one kernel. Do I need to use multiple SourceModule blocks?
Bhaskar Dhariyal
  • 1,343
  • 2
  • 13
  • 31
  • 1
    I don't understand this question. PyCUDA has an interface to curand and can directly fill device memory with random values. And the device side code *can* be used in kernels with a little effort. – talonmies Sep 12 '17 at 07:38
  • I know. The interface you are talking about is alias to #include in CUDA. But the random number I need can only be generated if there is something corresponding to #include. I didn't get your second part, "And the device side code can be used in kernels with a little effort." Are you talking about host? – Bhaskar Dhariyal Sep 12 '17 at 08:04
  • No, I am talking about the device side interface – talonmies Sep 12 '17 at 08:25
  • Can you show me how you are generating random number inside kernel? – Bhaskar Dhariyal Sep 12 '17 at 10:13

2 Answers2

6

Despite what you assert in your question, PyCUDA has pretty comprehensive support for CUrand. The GPUArray module has a direct interface to fill device memory using the host side API (noting that the random generators run on the GPU in this case).

It is also perfectly possible to use the device side API from CUrand in PyCUDA kernel code. In this use case the trickiest part is allocating memory for the thread generator states. There are three choices -- statically in code, dynamically using host memory side allocation, and dynamically using device side memory allocation. The following (very lightly tested) example illustrates the latter, seeing as you asked about it in your question:

import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray

code = """
    #include <curand_kernel.h>

    const int nstates = %(NGENERATORS)s;
    __device__ curandState_t* states[nstates];

    __global__ void initkernel(int seed)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t* s = new curandState_t;
            if (s != 0) {
                curand_init(seed, tidx, 0, s);
            }

            states[tidx] = s;
        }
    }

    __global__ void randfillkernel(float *values, int N)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t s = *states[tidx];
            for(int i=tidx; i < N; i += blockDim.x * gridDim.x) {
                values[i] = curand_uniform(&s);
            }
            *states[tidx] = s;
        }
    }
"""

N = 1024
mod = SourceModule(code % { "NGENERATORS" : N }, no_extern_c=True, arch="sm_52")
init_func = mod.get_function("_Z10initkerneli")
fill_func = mod.get_function("_Z14randfillkernelPfi")

seed = np.int32(123456789)
nvalues = 10 * N
init_func(seed, block=(N,1,1), grid=(1,1,1))
gdata = gpuarray.zeros(nvalues, dtype=np.float32)
fill_func(gdata, np.int32(nvalues), block=(N,1,1), grid=(1,1,1))

Here there is an initialization kernel which needs to be run once to allocate memory for the generator states and initialize them with the seed, and then a kernel which uses those states. You will need to be mindful of malloc heap size limits if you want to run a lot of threads, but those can be manipulated via the PyCUDA driver API interface.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • I'm getting this error while running LogicError: cuModuleLoadDataEx failed: device kernel image is invalid - – Bhaskar Dhariyal Sep 13 '17 at 07:06
  • 2
    @BhaskarDhariyal: Obviously you need to set the build architecture in the `SourceModule` instance to match your GPU model. – talonmies Sep 13 '17 at 08:25
  • The _values_ array containing random number I require; is for variable **r1** , **r2** which is part of statement `velocity[i] = X*(velocity[i]+c1*r1*(pBestPos[i] - x[i]) + c2*r2*(lBestIdx[i%d] - x[i]))`. According to above program, I can't access it directly because the given statement is in different kernel. How do I access _values_ array form the statement's kernel? – Bhaskar Dhariyal Sep 15 '17 at 10:40
  • 1
    Perhaps you misunderstand how SO works. You are supposed to ask a single question. When that question is answered, move on. This particular question/answer is not your own personal chat room or help desk. Your question about random numbers was answered. You're now asking a completely different question. Have a new question? Ask a new question. – Robert Crovella Sep 16 '17 at 21:49
  • what's up with the names `_Z10initkerneli` and `_Z14randfillkernelPfi`. Where did you get them from and why they don't match the original names? – grabantot Nov 26 '18 at 16:57
  • @grabantot To mangle the `__global__` function name, just compile using `nvcc` the function or an empty function with the same prototype using the `-ptx` option to emit a `ptx` file. – Vitality Feb 19 '19 at 08:17
  • @talonmies Using this approach, what kind of generator is used? `Sobol`? `XORWOW`? – Vitality Mar 20 '19 at 18:10
1

There is one problem I have with the accepted answer. We have a name mangling there which is sort of nasty (these _Z10initkerneli and _Z14randfillkernelPfi). To avoid that we can wrap the code in the extern "C" {...} clause manually.

code = """
    #include <curand_kernel.h>

    const int nstates = %(NGENERATORS)s;
    __device__ curandState_t* states[nstates];
    extern "C" {

    __global__ void initkernel(int seed)
    { .... }

    __global__ void randfillkernel(float *values, int N)
    { .... }
    }
"""

Then the code is still compiled with no_extern_c=True:

mod = SourceModule(code % { "NGENERATORS" : N }, no_extern_c=True)

and this should work with

init_func = mod.get_function("initkernel")
fill_func = mod.get_function("randfillkernel")

Hope that helps.

LemurPwned
  • 710
  • 10
  • 19