Arrays in CUDA Kernels using Python with numba-pro

Question

I'm currently writing code that can be heavily parallelized using GPUs. My code structure essentially looks like this:

Create two arrays, let's call them A and B of length N. (CPU)
Perform NxN calculations that eventually return a scalar. These calculations only depend on A and B and can therefore be parallelized. (GPU)
Gather all these scalars in a list and take the smallest one. (CPU)
Modify A and B with this scalar (CPU)
Go back to step 2 and repeat until a certain condition is met.

Most examples are very illustrative but they all seem to work like this: Execute the major part of the code on the CPU and only perform intermediate matrix multiplications etc. on the GPU. In particular the host usually knows all the variables the kernel is going to use.

For me its exactly vice versa, I want to perform the major part of the code on the GPU and only a very small amount of steps on the CPU itself. My host knows literally nothing about whats going on inside my individual threads. Its only managing the list of scalars as well as my arrays A and B.

My questions are therefore:

How do I properly define variables inside a kernel? In particular, how do I define and initialize arrays/lists?
How do I write a device function that returns an array? (s. below MatrixMultiVector doesn't work)
Why can I not use numpy and other libraries inside CUDA Kernels? What alternatives do I have?

An example of what I currently have looks like this:

from __future__ import division
import numpy as np
from numbapro import *


# Device Functions
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Works and can be called corrently from TestKernel Scalar
@cuda.jit('float32(float32, float32)', device=True)
def myfuncScalar(a, b):
    return a+b;


# Works and can be called correctly from TestKernel Array
@cuda.jit('float32[:](float32[:])', device=True)
def myfuncArray(A):
    for k in xrange(4):
        A[k] += 2*k;
    return A


# Takes Matrix A and Vector v, multiplies them and returns a vector of shape v. Does not even compile.
# Failed at nopython (nopython frontend), Only accept returning of array passed into the function as argument
# But v is passed to the function as argument...

@cuda.jit('float32[:](float32[:,:], float32[:])', device=True)
def MatrixMultiVector(A,v):
    tmp = cuda.local.array(shape=4, dtype=float32); # is that thing even empty? It could technically be anything, right?
    for i in xrange(A[0].size):
        for j in xrange(A[1].size):
            tmp[i] += A[i][j]*v[j];
    v = tmp;
    return v;



# Kernels
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# TestKernel Scalar - Works
@cuda.jit(void(float32[:,:]))
def TestKernelScalar(InputArray):
    i = cuda.grid(1)
    for j in xrange(InputArray[1].size):
        InputArray[i,j] = myfuncScalar(5,7);


# TestKernel Array
@cuda.jit(void(float32[:,:]))
def TestKernelArray(InputArray):

    # Defining arrays this way seems super tedious, there has to be a better way.
    M = cuda.local.array(shape=4, dtype=float32);
    M[0] = 1; M[1] = 0; M[2] = 0; M[3] = 0;

    tmp = myfuncArray(M);
    #tmp = MatrixMultiVector(A,M); -> we still have to define a 4x4 matrix for that.

    i = cuda.grid(1)
    for j in xrange(InputArray[1].size):
        InputArray[i,j] += tmp[j];

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Main
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------

N = 4;

C = np.zeros((N,N), dtype=np.float32);
TestKernelArray[1,N](C);

print(C)

I have not used Numba yet, so just a few thoughts from the "traditional" CUDA C perspective: (1) your "code structure" is not clear to me, do you want to generate NxN scalars and then select the smallest of those? why perform this min operation on the CPU instead parallelized on the GPU? (2) Have you ever written "traditional" CUDA C code? Usually on device allocation is kept to a minimum, as well as passing non-scalar data between device functions. (3) I would also minimize looping in kernels — m.s., Oct 30 '15 at 13:55
Yes, I first want to generate NxN scalars and then select the smallest of those. This selection can of course also be done parallely by the GPU. Unfortunately I have not yet had the pleasure of working directly with CUDA C. Though, I can assure you that the allocation on device is minimal as well as the data transfer between Host and Device is absolutely negligible. The only thing thats going to be permanently stored in memory is those arrays A and B as well as the list of scalars (the latter one shrinks drastically each iteration). — ICST, Oct 30 '15 at 14:09

talonmies · Accepted Answer · 2015-11-02T10:30:08.687

The short answer is you can't define dynamic lists or arrays in CUDA Python. You can have statically defined local or shared memory arrays (see cuda.local.array() and cuda.shared.array in the documentation), but those have thread or block scope and can't be reused after their associated thread or block is retired. But that is about all that is supported. You can pass externally defined arrays to kernels, but their attributes are read-only.
As per your myfuncArray you can return an externally defined array. You can't return a dynamically defined array, because dynamically defined arrays (or any objects for that matter) are not supported in kernels.
You can read the CUDA Python specification for yourself, but the really short answer is that CUDA Python is a superset of Numba's No Python Mode, and while there are elementary scalar functions available, there is no Python object model support. That excludes much Python functionality, including objects and numpy.

@ICST: It would be helpful if you accepted the answer so that this question gets off the unanswered question list. It will also make the question and answer set easy to find by search in the future — talonmies, Nov 02 '15 at 12:25

Arrays in CUDA Kernels using Python with numba-pro

1 Answers1