You can hide some of the copying latency. While you are copying a patch of image input, you can concurrently copy-back a result patch from a previous computation on GPU. On top of two-way copies overlapped, the computation of a third patch can be running. This can either shorten total latency of a single image's processing or multiple images' processing (but hiding latency of whole image processing this time).
For a very simple processing, only reading and writing can hide each other. Simple computation has no meaningful latency to hide anything else. So by pipelininig, you can up the performance by 100% (assuming 1 image is input, 1 equal size image is output and pcie/driver performs same for both directions).
If each pixel is just multiplied by a value, then it is embarrassingly parallel and you can hide latency by pipelining with arbitrary sized chunks. For example,
- copy N rows of pixels to vram
- compute N lines and concurrently copy N new lines to vram
- copy N results back to ram, (concurrently) compute N new lines, (concurrently/asynchronously) copy newest N lines to vram
- ...
- copy last result back to ram
You can either use 1 stream per in-flight N-scanlines (to do read+compute+write) and let the driver choose the best overlapping of scanline computations, or have 1 stream per operation type (1 for all writes, 1 for all reads,1 for all computations) and use events to maintain the overlapping behavior explicitly.
If you do more computation per pixel, like equal to the latency of copying, then pipelining would give you 3x performance (2 other operations hidden behind 1).