CPU-GPU Parallel programming (Python)

Question

Is there a way we could concurrently run functions on CPU and GPU (using Python)? I'm already using Numba to do thread level scheduling for compute intensive functions on the GPU, but I now also need to add parallelism between CPU-GPU. Once we ensure that the GPU shared memory has all the data to start processing, I need to trigger the GPU start and then in parallel run some functions on the host using the CPU.

I'm sure that the time taken by GPU to return the data is much more than the CPU to finish a task. So that once the GPU has finished processing, CPU is already waiting to fetch the data to the host. Is there a standard library/way to achieve this? Appreciate any pointers in this regard.

While this is doable if you were to write the CUDA code yourself, not sure if it is with numba. You would maybe add an OpenMPI layer to this. — Ander Biguri, Oct 29 '19 at 09:48
I'm writing CUDA code using Numba. It is exactly same as calling CUDA C kernels from Python but with an added advantage that one doesn't need to bother about the C-Python interface using PyCuda. Using Numba, everything happens in Python only. Could you also elaborate a bit more on OpenMPI? — Ankit_85, Oct 29 '19 at 10:24
If you can asyncronously call a kernel execution, then the CPU script will continue running, it will not wait until the kernel is finished (that is the meaning of asyncronous calls, which all kernels in CUDA are). So if Numba allows you to do that, then there is nothing to code extra, simply 1)call kernel 2)call CPU code 3)syncronize/memcpy from GPU to CPU will already do what you want. If you can not do that then you will need to make a multi-threaded application. — Ander Biguri, Oct 29 '19 at 11:08
Launch your kernel in numba, then run your CPU code after launching the kernel. That CPU code will run concurrently with the GPU kernel. This is the same asynchronous behavior as CUDA C++. — Robert Crovella, Oct 29 '19 at 13:40

score 0 · Accepted Answer · answered Oct 31 '19 at 11:38

Thanks Robert and Ander. I was thinking on similar lines but wasn't very sure. I checked that until I put some synchronization for task completion between the cores, (for ex. cp.cuda.Device().synchronize() when using CuPy) I'm effectively running GPU-CPU in parallel. Thanks again. A general flow with Numba, to make gpu_function and cpu_function run in parallel will be something like the following:

    """ GPU has buffer full to start processing Frame N-1 """
    tmp_gpu = cp.asarray(tmp_cpu)
    gpu_function(tmp_gpu)
    """ CPU receives Frame N over TCP socket """
    tmp_cpu = cpu_function()
    """ For instance we know cpu_function takes [a little] longer than gpu_function """
     cp.cuda.Device().synchronize()

Of course, we could even do away with the time spent in transferring tmp_cpu to tmp_gpu by employing PING-PONG buffer and initial frame delay.

CPU-GPU Parallel programming (Python)

1 Answers1

Linked