I'm trying to perform multiple async 2D convolutions on a single image with multiple filters using NVIDIA's NPP library method nppiFilterBorder_32f_C1R_Ctx
. However, even after creating multiple streams and assigning them to NPPI's method, the overlapping isn't happening; NVIDIA's nvvp
informs the same:
That said, I'm confused if NPP supports overlapping context operations.
Below is a simplification of my code, only showing the async method calls and related variables:
std::vector<NppStreamContext> streams(n_filters);
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamCreateWithFlags(&(streams[stream_idx].hStream), cudaStreamNonBlocking);
streams[stream_idx].nStreamFlags = cudaStreamNonBlocking;
// fill up NppStreamContext remaining fields
// malloc image and filter pointers
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaMemcpyAsync(..., streams[stream_idx].hStream);
nppiFilterBorder_32f_C1R_Ctx(..., streams[stream_idx]);
cudaMemcpy2DAsync(..., streams[stream_idx].hStream);
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamSynchronize(streams[stream_idx].hStream);
cudaStreamDestroy(streams[stream_idx].hStream);
}
Note: All the device pointers of the output images and input filters are stored in a std::vector
, where I access them via the current stream index (e.g., float *ptr_filter_d = filters[stream_idx]
)