Transferring a Pointer From C++ To Python Compatible with Host and Device Memory

Question

I have a Python function (named apply_filter), whose execution may involve either the CPU (using NumPy) and GPU (using CuPy). The function takes an input-buffer object, represting a pointer to data either in system memory or on the GPU's global device memory.

I want to invoke this from a C++ code using the Python C API. In order to do so, I need to supply the function, on the C++ side with something to pass as the input-buffer object - which in my case will correspond to a raw pointer. But I'm not sure how to do this.

Here is a simplified version of my code:

The invoking code, in C++:

#include <Python.h>

void PythonObjectWrapper::applyFilter(float* image, std::array<int, 3> dim) {
    PyObject* python_method = PyObject_GetAttrString(class_object_, method_name_);
    PyObject* py_image = ??? // convert C-array to PyObject
    PyObject* method_args = PyTuple_New(2);
    PyTuple_SetItem(method_args, 0, py_image);
    PyTuple_SetItem(method_args, 1, ...); // transfer dim
    PyObject* py_filtered_image = PyObject_CallObject(python_method, method_args);
    float* filtered_image = ??? // convert PyObject to C-array
}

The invoked function, in Python:

class Filter:
    def __init__(self, gpu):
        self.gpu_ = gpu

    def apply_filter(self, image_ptr, dim)
        image_array = ??? // convert image_ptr PyObject to NumPy / CuPy array
        apply_filter_(image_array)
        filtered_image_ptr = ??? // convert image_array to ptr
        return filtered_image_ptr

How do I complete the 4 lines marked with ????

Bonus points for a solution avoiding any unnecessary copies (especially from Host to Device in some direction) and do everything efficiently and will support both run modes (CPU/GPU) in a robust manner.

score 4 · Answer 1 · edited Aug 24 '23 at 11:37

_{This solution may not be optimal or the most efficient, but it does work:}

There is a delicate way to handle each of the 4 ??? signs you spread out your code. Let's go over them in order -

Convert C-ptr to PyObject on Host

A convenient way to do so is to use PyByteArray:

PyByteArray_FromStringAndSize(
    reinterpret_cast<char *>(image),
    sizeof(float) * dim[0] * dim[1] * dim[2]);

Convert C-ptr to PyObject on Device

In this case, PyByteArray won't deliver the goods, since it is only suitable for continuous memory on the Host. A convenient way to wrap a raw pointer as a PyObject is PyCapsule, which can be initialized as follows -

PyCapsule_New(reinterpret_cast<void *>(image), "image", NULL);

Note that the destructor is not needed here (sends NULL) since the C-code is in charge of this allocated device memory.

Convert PyObject to Numpy Array

The PyByteArray points to contiguous memory on the Host, and can thus be read as a simple buffer by NumPy using -

image_buffer = np.frombuffer(
    image_ptr,
    dtype=np.float32,
    count=dims[0] * dims[1] * dims[2])
image_array = np
    .asarray(image_buffer, type=np.float32)
    .reshape(dims[2], dims[1], dims[0])
    .transpose(1, 2, 0)

The reshape and transpose operations are needed in order to convert the array shape from C-order (as used by C++) to Fortran-order (as used by Numpy).

Convert PyObject to CuPy Array

So that's probably the most tricky one. You need to use the Python C-API directly (using ctypes.pythonapi) in order to unpack the pointer, and then some CuPy utilities to transform it into an array. The PyCapsulte_GetPointer method is not compatible with the exact way our PyCapsule was created (I still do not completely understand why), and thus requires manual re-definition of the expected restype and argtypes.

First, we need to open the PyCapsule obtaining the raw pointer on the device -

ctypes.pythonapi.PyCapsule_GetPointer.restype = ctypes.c_void_p
ctypes.pythonapi.PyCapsule_GetPointer.argtypes = 
    [ctypes.py_object, ctypes.c_void_p]
raw_address = ctypes.pythonapi.PyCapsule_GetPointer(
    image_ptr, self.pycapsule_name_.encode('utf-8'))
raw_ptr = ct.c_void_p(raw_address)

Now, we need to define the CuPy array based on this raw_ptr with the appropriate size -

mem = cp.cuda.MemoryPointer(
    cp.cuda.UnownedMemory(
        raw_ptr.value,
        dims[0] * dims[1] * dims[2] * cp.dtype(cp.float32).itemsize,
        None),
    0)
cupy_array = cp.ndarray(dims, dtype=cp.float32, memptr=mem)
cupy_array = cp
    .asarray(cupy_array, dtype=cp.float32)
    .reshape(dims[2], dims[1], dims[0])
image_array = cp.transpose(cupy_array, axes=(1, 2, 0))

And that's it (for the input...)! Now you can robustly write your code (using either np or cp prefix using an appropriate wrapper) to work on both CPU and GPU.

Oh, you also want to return this array as a raw pointer back to C++? This raises some more complications:

Convert NumPy Array to PyObject

That's easy, simply

filtered_image_ptr = image_array.copy(order='C').data

Convert CuPy Array to PyOjbect

Here you need to again wrap your raw pointer as a PyCapsule. Again you need to redefine the restype and argtypes of the Python C-API methods.

 ctypes.pythonapi.PyCapsule_New.restype = ctypes.py_object
 PyCapsule_Destructor = ctypes.CFUNCTYPE(None, ctypes.py_object)
 ctypes.pythonapi.PyCapsule_New.argtypes = 
    [ctypes.c_void_p, ctypes.c_char_p, PyCapsule_Destructor]

 image_raw_ptr = ctypes.c_void_p(image_array.data.ptr)
 name = ctypes.c_char_p(f"b'{self.pycapsule_name_}'")
 filtered_image_ptr = ctypes.pythonapi.PyCapsule_New(image_raw_ptr, name, PyCapsule_Destructor(0))

Convert PyObject to C-ptr on Host

You can unpack the value returned from the NumPy array as a Py_buffer.

Py_buffer buffer;
PyObject_GetBuffer(py_filtered_image, &buffer, PyBUF_FORMAT);
memcpy(
    filtered_image,
    buffer.buf, dim[0] * dim[1] * dim[2] * sizeof(float));

Convert PyObject to C-ptr on Device

Simply unpack the PyCapsule. Here for some reason there's no need for redefinition of restype and argtypes.

auto* filtered_image_ptr = reinterpret_cast<float*>(
    PyCapsule_GetPointer(py_filtered_image, "slices"));
cudaMemcpy(
    filtered_image, 
    filtered_image_ptr, dim_.Volume() * sizeof(float), 
    cudaMemcpyHostToHost);

Try and fit your code to the answer width, even if it means breaking lines. — einpoklum, Aug 24 '23 at 11:32

Transferring a Pointer From C++ To Python Compatible with Host and Device Memory

1 Answers1