Why could OpenCV wait for a stream-ed CUDA operation instead of proceeding asynchronously?

Question

I'm trying to perform some image dilation using OpenCV & CUDA. I invoke two calls to filter->apply(...) with a different filter object and on a different Mat, after each other, every time specifying a different stream to work with. They DO get executed in different streams, as can be seen from the attached nvvp profiling info, but they run sequentially, instead of in parallel. This seems to be caused, for some reason, by the CPU waiting for the stream (cudaStreamSynchronize). Why could OpenCV do that? I'm not calling the wait for the stream explicitly or anything, what else could be wrong?

Here's the actual code:

    cv::Mat hIm1, hIm2;
    cv::imread("/path/im1.png", cv::IMREAD_GRAYSCALE).convertTo(hIm1, CV_32FC1);
    cv::imread("/path/im2.png", cv::IMREAD_GRAYSCALE).convertTo(hIm2, CV_32FC1);
    cv::cuda::GpuMat dIm1(hIm1);
    cv::cuda::GpuMat dIm2(hIm2);

    cv::cuda::Stream stream1, stream2;

    const cv::Mat strel1 = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(41, 41));
    cv::Ptr<cv::cuda::Filter> filter1 = cv::cuda::createMorphologyFilter(cv::MORPH_DILATE, dIm1.type(), strel1);
    const cv::Mat strel2 = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(41, 41));
    cv::Ptr<cv::cuda::Filter> filter2 = cv::cuda::createMorphologyFilter(cv::MORPH_DILATE, dIm2.type(), strel2);
    cudaDeviceSynchronize();
    filter1->apply(dIm1, dIm1, stream1);
    filter2->apply(dIm2, dIm2, stream2);
    cudaDeviceSynchronize();

The images are sized 512×512; I tried it with smaller ones (down to 64×64) but to no avail!

I don't mind that you guys downvote the question if you think it's bad, but you could add a comment saying how to improve it! — BIOStheZerg, May 22 '20 at 07:50

score 0 · Answer 1 · answered Jun 22 '20 at 17:35

It was user responsibility to run the application sequentially.

Few Best Practices:

Pipeline your code so that both CPU and GPU utilized at the same time. Make GPU call as asynchronous.
GPU Need resources to run sequentially. If the filter1() utilized 100% of GPU, then the filter2() will wait in the pipeline until filter1() completes.

Please check with the GPU utilization data in profiler for more details.

Why could OpenCV wait for a stream-ed CUDA operation instead of proceeding asynchronously?

1 Answers1