2

I am trying to iterate over a cv::cuda::GpuMat with the following code:

__global__ void kernel(uchar* src, int rows, int cols, size_t step)
{
    int rowInd = blockIdx.y * blockDim.y + threadIdx.y;
    int colInd = blockIdx.x * blockDim.x + threadIdx.x;

    if ((rowInd < rows) && (colInd < cols))
    {
        uchar * rowptr = src + (rowInd * step);
        rowptr[colInd] = 255;
    }

}

void invoke_kernel(cv::cuda::GpuMat _img)
{
    dim3 tpb(50, 50);
    dim3 bpg(((_img.cols + 49) / 50), ((_img.rows + 49)/ 50));
    kernel<<<bpg, tpb>>> (_img.data, _img.rows, _img.cols, _img.step);

}

int main()
{


    cv::cuda::GpuMat mat;
    mat.create(cv::Size(500, 500), CV_8UC1);
    std::cout << mat.rows << " " << mat.cols << std::endl;
    invoke_kernel(mat);

    cv::Mat img;
    mat.download(img);

    cv::namedWindow("test");
    cv::imshow("test", img);
    cv::waitKey(0);

    return 0;
}

As you can see, it's just supposed to set the entire (originally black) image to white in kernels.

The image just stays black, other than first columnm, that is white. It feels like I'm doing something really stupid somewhere there, but I just can't figure it out :/

I checked that the kernel is running with correct dimnensions (10x10 blocks, 50x50 threads per block). Another problem comes when I try to use NVIDIA debugger. After some googling I've found that NVCC compiler seems to be deleting my rowInd and colInd variables during optimization and therefore the debugger does not display their values.

Thanks for your time.

ARentalTV
  • 344
  • 1
  • 12
  • Well, turns out the kernel isnt even starting, idk why but even if i just get an inifnite loop there, nothing changes – ARentalTV Apr 27 '17 at 17:23
  • 1
    any time you are having trouble with a cuda code, it's good practice to use [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) and run your code with `cuda-memcheck`. – Robert Crovella Apr 27 '17 at 18:27
  • @RobertCrovella yeah that's actually how i figured it out, cudaGetLastError returned me cudaErrorInvalidConfiguration, i googled some and found out that the block size was too big – ARentalTV Apr 27 '17 at 19:46

1 Answers1

1

Well, it turns out, the issue was that the block size 50x50 = 2500 was bit too much. There is some kind of limit, that i hasnt figured out yet, but, as stated in the CUDA docs, 16x16 was fine.

So, just to clarify:

dim3 tpb(16, 16);
dim3 bpg(((_img.cols + 15) / 16), ((_img.rows + 15)/ 16));

in the invoke_kernel does the job.

Always read the docs, kids.

ARentalTV
  • 344
  • 1
  • 12
  • The limit is that CUDA kernels are limited to 1024 threads per block, which limit is composed of the product of the block dimensions x,y,z. This is documented in the CUDA programming guide, and also via the cuda sample code deviceQuery, and it is the source of dozens of questions here on the CUDA tag. – Robert Crovella Apr 27 '17 at 18:20