2

NumPy provides well-defined C APIs so that one can easily handle NumPy array in C/C++ space. For example, if I have a C function that takes C arrays (pointers) as arguments, I can just #include <numpy/arrayobject.h>, and pass a NumPy array to it by accessing its data member (or use the C API PyArray_DATA).

Recently I want to achieve the same for CuPy, but I cannot find a header file that I can include. To be specific, my goal is as follows:

  • I have some CUDA kernels and their callers written in C/C++. The callers run on host but take handles of memory buffers on device as arguments. The computed results of the callers are also stored on device.
  • I want to wrap the callers into Python functions so that I can control when to transfer data from device to host in Python. That means I have to wrap the resulted device memory pointers in Python objects. CuPy's ndarray is the best choice I can think of.

I can't use CuPy's user-defined-kenrel mechanism because the functions I want to wrap are not directly CUDA kernels. They must contain host code.

Currently, I've found a workaround. I write the Python functions in cython, which take CuPy arrays as inputs and return CuPy arrays. And then I cast .data.ptr attribute into C's size_t type, and then further cast it to whatever pointer type I need. Example code follows.

Example Code

//kernel.cu

#include <math.h>

__global__ void vecSumKernel(float *A, float *B, float *C, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    if (i < n)
        C[i] = A[i] + B[i];
}

// This is the C function I want to wrap into Python.
// Notice it does not allocate any memory on device. I want that to be done by cupy.
extern "C" void vecSum(float *A_d, float *B_d, float *C_d, int n) {
    int threadsPerBlock = 512;
    if (threadsPerBlock > n) threadsPerBlock = n;
    int nBlocks = (int)ceilf((float)n / (float)threadsPerBlock);

    vecSumKernel<<<nBlocks, threadsPerBlock>>>(A_d, B_d, C_d, n);
}
//kernel.h

#ifndef KERNEL_H_
#define KERNEL_H_

void vecSum(float *A_d, float *B_d, float *C_d, int n);

#endif
# test_module.pyx

import cupy as cp
import numpy as np


cdef extern from "kernel.h":

    void vecSum(float *A_d, float *B_d, float *C_d, int n)


cdef vecSum_wrapper(size_t aPtr, size_t bPtr, size_t cPtr, int n):
    # here the Python int -- cp.ndarray.data.ptr -- is first cast to size_t,
    # and then cast to (float *).

    vecSum(<float*>aPtr, <float*>bPtr, <float*>cPtr, n)


# This is the Python function I want to use
# a, b are cupy arrays
def vec_sum(a, b):
    a_ptr = a.data.ptr
    b_ptr = b.data.ptr

    n = a.shape[0]

    output = cp.empty(shape=(n,), dtype=a.dtype)
    c_ptr = output.data.ptr

    vecSum_wrapper(a_ptr, b_ptr, c_ptr, n)
    return output

Compile and Run

To compile, one can first compile the kernel.cu into a static library, say, libVecSum. Then use cython to compile test_module.pyx int test_module.c, and build the Python extension as usual.

# setup.py

from setuptools import Extension, setup

ext_module = Extension(
    "cupyExt.test_module",
    sources=["cupyExt/test_module.c"],
        library_dirs=["cupyExt/"],
        libraries=['libVecSum', 'cudart'])

setup(
    name="cupyExt",
    version="0.0.0",
    ext_modules = [ext_module],
)

It seems working.

>>> import cupy as cp
>>> from cupyExt import test_module
>>> a = cp.ones(5, dtype=cp.float32) * 3
>>> b = cp.arange(5, dtype=cp.float32)
>>> c = test_module.vec_sum(a, b)
>>> print(c.device)
<CUDA Device 0>
>>> print(c)
[3. 4. 5. 6. 7.]

Any better ways?

I am not sure if this way is memory safe. I also feel the casting from .data.ptr to C pointers is not good. I want to know people's thoughts and comments on this.

Chen
  • 113
  • 5

0 Answers0