This solution may not be optimal or the most efficient, but it does work:
There is a delicate way to handle each of the 4 ???
signs you spread out your code. Let's go over them in order -
Convert C-ptr to PyObject on Host
A convenient way to do so is to use PyByteArray
:
PyByteArray_FromStringAndSize(
reinterpret_cast<char *>(image),
sizeof(float) * dim[0] * dim[1] * dim[2]);
Convert C-ptr to PyObject on Device
In this case, PyByteArray
won't deliver the goods, since it is only suitable for continuous memory on the Host. A convenient way to wrap a raw pointer as a PyObject
is PyCapsule
, which can be initialized as follows -
PyCapsule_New(reinterpret_cast<void *>(image), "image", NULL);
Note that the destructor is not needed here (sends NULL) since the C-code is in charge of this allocated device memory.
Convert PyObject to Numpy Array
The PyByteArray
points to contiguous memory on the Host, and can thus be read as a simple buffer by NumPy
using -
image_buffer = np.frombuffer(
image_ptr,
dtype=np.float32,
count=dims[0] * dims[1] * dims[2])
image_array = np
.asarray(image_buffer, type=np.float32)
.reshape(dims[2], dims[1], dims[0])
.transpose(1, 2, 0)
The reshape and transpose operations are needed in order to convert the array shape from C-order (as used by C++) to Fortran-order (as used by Numpy).
Convert PyObject to CuPy Array
So that's probably the most tricky one. You need to use the Python C-API directly (using ctypes.pythonapi
) in order to unpack the pointer, and then some CuPy
utilities to transform it into an array. The PyCapsulte_GetPointer
method is not compatible with the exact way our PyCapsule
was created (I still do not completely understand why), and thus requires manual re-definition of the expected restype
and argtypes
.
First, we need to open the PyCapsule
obtaining the raw pointer on the device -
ctypes.pythonapi.PyCapsule_GetPointer.restype = ctypes.c_void_p
ctypes.pythonapi.PyCapsule_GetPointer.argtypes =
[ctypes.py_object, ctypes.c_void_p]
raw_address = ctypes.pythonapi.PyCapsule_GetPointer(
image_ptr, self.pycapsule_name_.encode('utf-8'))
raw_ptr = ct.c_void_p(raw_address)
Now, we need to define the CuPy
array based on this raw_ptr
with the appropriate size -
mem = cp.cuda.MemoryPointer(
cp.cuda.UnownedMemory(
raw_ptr.value,
dims[0] * dims[1] * dims[2] * cp.dtype(cp.float32).itemsize,
None),
0)
cupy_array = cp.ndarray(dims, dtype=cp.float32, memptr=mem)
cupy_array = cp
.asarray(cupy_array, dtype=cp.float32)
.reshape(dims[2], dims[1], dims[0])
image_array = cp.transpose(cupy_array, axes=(1, 2, 0))
And that's it (for the input...)! Now you can robustly write your code (using either np
or cp
prefix using an appropriate wrapper) to work on both CPU and GPU.
Oh, you also want to return this array as a raw pointer back to C++
? This raises some more complications:
Convert NumPy Array to PyObject
That's easy, simply
filtered_image_ptr = image_array.copy(order='C').data
Convert CuPy Array to PyOjbect
Here you need to again wrap your raw pointer as a PyCapsule
. Again you need to redefine the restype
and argtypes
of the Python C-API methods.
ctypes.pythonapi.PyCapsule_New.restype = ctypes.py_object
PyCapsule_Destructor = ctypes.CFUNCTYPE(None, ctypes.py_object)
ctypes.pythonapi.PyCapsule_New.argtypes =
[ctypes.c_void_p, ctypes.c_char_p, PyCapsule_Destructor]
image_raw_ptr = ctypes.c_void_p(image_array.data.ptr)
name = ctypes.c_char_p(f"b'{self.pycapsule_name_}'")
filtered_image_ptr = ctypes.pythonapi.PyCapsule_New(image_raw_ptr, name, PyCapsule_Destructor(0))
Convert PyObject to C-ptr on Host
You can unpack the value returned from the NumPy
array as a Py_buffer
.
Py_buffer buffer;
PyObject_GetBuffer(py_filtered_image, &buffer, PyBUF_FORMAT);
memcpy(
filtered_image,
buffer.buf, dim[0] * dim[1] * dim[2] * sizeof(float));
Convert PyObject to C-ptr on Device
Simply unpack the PyCapsule
. Here for some reason there's no need for redefinition of restype
and argtypes
.
auto* filtered_image_ptr = reinterpret_cast<float*>(
PyCapsule_GetPointer(py_filtered_image, "slices"));
cudaMemcpy(
filtered_image,
filtered_image_ptr, dim_.Volume() * sizeof(float),
cudaMemcpyHostToHost);