Overcoming the copy overhead in CUDA

Question

I want to parallelize an image operation on the GPU using CUDA, using a thread for each pixel (or group of pixels) of an image. The operation is quite simple: each pixel is multiplied for a value.

However, if I understand it correctly, in order to put the image on the GPU and have it processed in parallel, I have to copy it to unified memory or some other GPU-accessible memory, which is basically a double for loop like the one that would process the image on the CPU. I am wondering whether there is a more efficient way to copy an image (i.e. a 1D or 2D array) on the GPU that does not have an overhead such that the parallelization is useless.

If the only operation you want to perform is take an image and multiply each pixel by a value, and the image is not already on the GPU for some reason, nobody in their right mind would use a GPU for that (except maybe for learning purposes). You need to find more involved work for the GPU to do, before the performance starts to become interesting. — Robert Crovella, May 25 '21 at 07:37

score 3 · Accepted Answer · answered May 26 '21 at 21:15

However, if I understand it correctly, in order to put the image on the GPU and have it processed in parallel, I have to copy it to unified memory or some other GPU-accessible memory

You understand correctly.

I am wondering whether there is a more efficient way to copy an image (i.e. a 1D or 2D array) on the GPU that does not have an overhead

There isn't. Data in host system memory must pass over the PCIE bus to get to GPU memory. This is bound by the PCIE bus bandwidth (~12GB/s for PCIE Gen3) and also has some "fixed overhead" associated with it, at least on the order of a few microseconds per transfer, such that very small transfers appear to be worse off from a performance (bytes/s) perspective.

such that the parallelization is useless.

If the only operation you want to perform is take an image and multiply each pixel by a value, and the image is not already on the GPU for some reason, nobody in their right mind would use a GPU for that (except maybe for learning purposes). You need to find more involved work for the GPU to do, before the performance starts to become interesting

The operation is quite simple

That's generally not a good indicator for a performance benefit from GPU acceleration.

score 1 · Answer 2 · answered May 27 '21 at 15:27

When you say "which is basically a double for loop like the one that would process the image on the CPU", I hope you don't mean copying pixel by pixel on each row and then each column. You can use memcpy to copy the whole image over. However, as other people said, there is still a rather big overhead to move data between CPU and GPU, unless your calculations on the GPU is complex enough to justify the overhead.

huseyin tugrul buyukisik · Answer 3 · 2021-05-27T20:42:10.727

You can hide some of the copying latency. While you are copying a patch of image input, you can concurrently copy-back a result patch from a previous computation on GPU. On top of two-way copies overlapped, the computation of a third patch can be running. This can either shorten total latency of a single image's processing or multiple images' processing (but hiding latency of whole image processing this time).

For a very simple processing, only reading and writing can hide each other. Simple computation has no meaningful latency to hide anything else. So by pipelininig, you can up the performance by 100% (assuming 1 image is input, 1 equal size image is output and pcie/driver performs same for both directions).

If each pixel is just multiplied by a value, then it is embarrassingly parallel and you can hide latency by pipelining with arbitrary sized chunks. For example,

copy N rows of pixels to vram
compute N lines and concurrently copy N new lines to vram
copy N results back to ram, (concurrently) compute N new lines, (concurrently/asynchronously) copy newest N lines to vram
...
copy last result back to ram

You can either use 1 stream per in-flight N-scanlines (to do read+compute+write) and let the driver choose the best overlapping of scanline computations, or have 1 stream per operation type (1 for all writes, 1 for all reads,1 for all computations) and use events to maintain the overlapping behavior explicitly.

If you do more computation per pixel, like equal to the latency of copying, then pipelining would give you 3x performance (2 other operations hidden behind 1).

Overcoming the copy overhead in CUDA

3 Answers3