1

CUDA allows to overlap computation and data transfer using cuMemcpy async functions and streams. But is it possible with NPP(Performance Primitives)?

A little background. I am trying to utilize GPU using NPP image resize functions (in our case it is nppiResize_8u_C3R). I am using pinned memory and successfully transfer data to GPU using cuMemcpy2DAsync_v2 and per thread stream. The problem is that nppiResize_8u_C3R and all other computation functions do not accept streams.

When I run Nvidia Visual Profiler I see the next:

  1. Pinned memory allows me to transfer data faster - ~6.524 GB/s.
  2. The percentage of time when memcpy is being performed in parallel with compute is 0%.

1 Answers1

2

The problems [sic] is that nppiResize_8u_C3R and all other computation functions do not accept streams.

NPP is fundamentally a stateless API. However, to use streams with NPP, you use nppSetStream to set the default stream for subsequent operations. There are several caveats noted on page 2 of the documentation about using NPP with streams and recommended synchronization practices when switching streams.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 1
    Is it possible to get a performance boost using *nppSetStream*? I tried it, but had no success. I have a feeling that NPP is not designed for concurrent or overlap scenarios, and to utilize GPU I need to use CUDA driver API directly. – Artur Ispiriants Jan 22 '17 at 17:16