CUDA supports dynamic shared memory allocation at kernel run time, but the mechanism is a bit different to OpenCL. In the CUDA runtime API, a kernel using dynamically allocated/sized shared memory and the launch to size the memory uses the following syntax:
__global__ void kernel(...)
{
extern __shared__ typename buffer[];
....
}
....
kernel <<< griddim, blockdim, sharedmem, streamID >>> (...)
where sharedmem
is the total number of bytes per block which will be allocated to buffer.
In PyCUDA, the same mechanism works something like this:
mod = SourceModule("""
__global__ void kernel(...)
{
extern __shared__ typename buffer[];
....
}
""")
func = mod.get_function("kernel")
func.prepare(..., shared=sharedmem)
func.prepared_call(griddim,blockdim,...)
with the shared memory allocation size passed to the prepare
method.