1

I'm trying to perform multiple async 2D convolutions on a single image with multiple filters using NVIDIA's NPP library method nppiFilterBorder_32f_C1R_Ctx. However, even after creating multiple streams and assigning them to NPPI's method, the overlapping isn't happening; NVIDIA's nvvp informs the same:

Non-overlapping streams

That said, I'm confused if NPP supports overlapping context operations.

Below is a simplification of my code, only showing the async method calls and related variables:

std::vector<NppStreamContext> streams(n_filters);

for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
  cudaStreamCreateWithFlags(&(streams[stream_idx].hStream), cudaStreamNonBlocking);
  streams[stream_idx].nStreamFlags = cudaStreamNonBlocking;
  // fill up NppStreamContext remaining fields 
  // malloc image and filter pointers
}

for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
  cudaMemcpyAsync(..., streams[stream_idx].hStream);
  nppiFilterBorder_32f_C1R_Ctx(..., streams[stream_idx]);
  cudaMemcpy2DAsync(..., streams[stream_idx].hStream);
}

for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
  cudaStreamSynchronize(streams[stream_idx].hStream);
  cudaStreamDestroy(streams[stream_idx].hStream);
}

Note: All the device pointers of the output images and input filters are stored in a std::vector, where I access them via the current stream index (e.g., float *ptr_filter_d = filters[stream_idx])

Gustavo Stahl
  • 101
  • 1
  • 7
  • If you look closely it looks like there is a slight overlap, right? What's the occupation on those kernels? If they are 100%, then you wouldn't see an overlap, right? Try with kernels that are too small to fill the GPU. – Homer512 Nov 17 '22 at 18:55
  • @Homer512 The occupancy reported in `nvprof` is ~95% for each NPPI kernel. I tried limiting the number of resources per stream, but it didn't change the overlap. – Gustavo Stahl Nov 17 '22 at 19:26
  • What do you mean "limiting number of resources"? Try shrinking the input arrays until the theoretical occupancy is reported less than 50% – Homer512 Nov 17 '22 at 19:42
  • By limiting the number of resources per stream, I meant reducing the amount of shared memory and threads per block, among other resources available in the `NppStreamContext` struct, since they contribute to the occupancy. By shrinking the input size you mean the size of the input image? If that's the case, I'm not sure the output of the 2D convolution will be the same. At the moment I'm working with an input image of 512 x 512, and 7 x 7 filters. – Gustavo Stahl Nov 17 '22 at 20:06
  • 1
    I think @Homer512 meant utilization (i.e. how many SMs are utilized). Occupancy is about efficiency in using every SM to its fullest extents in terms of shared memory, registers and threads. – paleonix Nov 17 '22 at 20:16
  • 1
    Remember that we are trying to figure out whether parallel kernel execution works, not getting accurate results. 512x512 is 262144 pixels; and convolution can be fully parallelized. If we assume one thread per pixel, that's an order of magnitude more pixels than GPU threads on your device (2048 resident threads per SM on compute capability 6.1; 5 SM in a GTX 1050). There is not really much of a point in parallel execution of such large kernels. Why should the GPU schedule them so? – Homer512 Nov 17 '22 at 20:16
  • @paleonix in this case the number of SMs being used is 5 out of 5, as reported by `nvprof`. – Gustavo Stahl Nov 17 '22 at 20:28
  • @Homer512 I tried using OpenCV's `cv::filter2D` method for the 2D convolution, which is done on the CPU. The time taken for each call was ~12ms and it's done 24 times. Since my application must work in real-time, I thought this wasn't the way to go with the convolutions, and I decided to rely on the GPU for the computations. NPPI's convolution implementation takes around ~4ms, which is excellent for my purpose. Hence, due to the nature of the GPU, you're suggesting I rely on CPU parallelization rather than GPU's? – Gustavo Stahl Nov 17 '22 at 20:37
  • 1
    You might be able to get more overlap by using higher priority streams for later launches. I know that trick from [this multi-gpu tutorial](https://github.com/FZJ-JSC/tutorial-multi-gpu/blob/main/06-H_Overlap_Communication_and_Computation_MPI/solutions/jacobi.cpp#L219-L243) where one wants overlap between the big compute kernel and the small communication kernels. But the big compute kernel is launched first because it takes the longest. – paleonix Nov 17 '22 at 20:39
  • He means that the GPU will only overlap kernels if it has unused resources. Overlapping the kernels in this case will cause the results of all kernels but the last one to be ready later while without overlap you at least get the results of the first one as fast as possible and so on. – paleonix Nov 17 '22 at 20:43
  • I don't mind having the results done later, as long as the sum of time elapsed per kernel is lesser that the one I currently have. I'm sorry if I didn't understand, but how can I free resources for the GPU without compromising the input/output variables? The only way I knew was to limit the stream resources in `NppStreamContext` struct. I'll give it a try to the priority streams you mentioned @paleonix. – Gustavo Stahl Nov 17 '22 at 20:59
  • 1
    But the sum of time **will not be lesser**. Your GPU is already doing as much as it can. With a bigger GPU you might see more overlap. Forcing it via the priority trick doesn't seem to make sense in your context. As said before: there **is** clear overlap visible in your screenshot. The priority trick will just cause the GPU to schedule a wave of blocks from the higher priority grid when current wave of blocks is done. This means that each kernel as seen in the profiler will take **longer**. – paleonix Nov 17 '22 at 21:02
  • Oh, I see. Any idea if writing my own kernel such that it takes into account these multiple filters can be helpful for this task? Other than that I don't think there's much else for me to do. – Gustavo Stahl Nov 17 '22 at 21:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249713/discussion-between-paleonix-and-gustavo-stahl). – paleonix Nov 17 '22 at 21:13

1 Answers1

2

To summarize and add to the comments: The profile does show small overlaps, so the answer to the title question is clearly yes.

The reason for the overlap being so small is just that each NPP kernel already needs all resources of the used GPU for most of its runtime. At the end of each kernel one can probably see the tail effect (i.e. the number of blocks is not a multiple of the number of blocks that can reside in SMs at each moment in time), so blocks from the next kernel are getting scheduled and there is some overlap.

It can sometimes be useful (i.e. an optimization) to force overlap between a big kernel which was started first and uses the full device and a later small kernel that only needs a few resources. In that case one can use stream priorities via cudaStreamCreateWithPriority to hint the scheduler to schedule blocks from the second kernel before blocks from the first kernel. An example of this can be found in this multi-GPU example (permalink).

In this case however, as the size of the kernels is the same and there is no reason to prioritize any of them over the others, forcing an overlap like this would not decrease the total runtime because the compute resources are limited. In the profiler view the kernels might then show more overlap but also each one would take more time. That is the reason why the scheduler does not overlap the kernels even though you allow it to do so by using multiple streams (See asynchronous vs. parallel).

To still increase performance, one could write a custom CUDA kernel that does all the filters in one kernel launch. The main reason that this could be a better than using NPP in this case is that all NPP kernels take the same input image. Therefore a single kernel could significantly decrease the number of accesses to global memory by reading in each tile of the input image only once (to shared memory, although L1 caching might suffice), then apply all the filters sequentially or in parallel (by splitting the thread block up into smaller units) and write out the results.

paleonix
  • 2,293
  • 1
  • 13
  • 29