No result obtained from calculation

Question

I am trying to convolve an image using CUDA, but I cannot get a result. cuda-gdb does not work properly on my system so I cannot tell what is happening inside the CUDA kernel. The CUDA kernel I am using is the following:

__global__
void
convolve_component_EXTEND_kern(const JSAMPLE *data, // image data
                           ssize_t data_width, // image width
                           ssize_t data_height, // image height
                           const float *kern, // convolution kernel data
                           ssize_t kern_w_f, // convolution kernel has a width of 2 * kern_w_f + 1
                           ssize_t kern_h_f, // convolution_kernel has a height of 2 * kern_h_f + 1
                           JSAMPLE *res) // array to store the result
{
ssize_t i = ::blockIdx.x * ::blockDim.x + ::threadIdx.x;
ssize_t j = ::blockIdx.y * ::blockDim.y + ::threadIdx.y;

float value = 0;

for (ssize_t m = 0; m < 2 * kern_w_f + 1; m++) {
    for (ssize_t n = 0; n < 2 * kern_h_f + 1; n++) {
            ssize_t x = i + m - kern_w_f; // column index for this contribution to convolution sum for (i, j)
            ssize_t y = j + n - kern_h_f; // row index for ...
            x = x < 0 ? 0 : (x >= data_width ? data_width - 1 : x);
            y = y < 0 ? 0 : (y >= data_height ? data_height - 1 : y);
            value += ((float) data[data_width * y + x]) * kern[(2 * kern_w_f + 1) * n + m];
    }
}

res[data_width * j + i] = (JSAMPLE) value;
}

and I am invoking it in this function

void
convolve_component_EXTEND_cuda(const JSAMPLE *data,
                           ssize_t data_width,
                           ssize_t data_height,
                           const float *kern,
                           ssize_t kern_w_f,
                           ssize_t kern_h_f,
                           JSAMPLE *res)
{
JSAMPLE *d_data;
cudaMallocManaged(&d_data,
                  data_width * data_height * sizeof(JSAMPLE));
cudaMemcpy(d_data,
           data,
           data_width * data_height * sizeof(JSAMPLE),
           cudaMemcpyHostToDevice);

float *d_kern;
cudaMallocManaged(&d_kern,
                  (2 * kern_w_f + 1) * (2 * kern_h_f + 1) * sizeof(float));
cudaMemcpy(d_kern,
           kern,
           (2 * kern_w_f + 1) * (2 * kern_h_f + 1) * sizeof(float),
           cudaMemcpyHostToDevice);

JSAMPLE *d_res;
cudaMallocManaged(&d_res,
                  data_width * data_height * sizeof(JSAMPLE));

dim3 threadsPerBlock(16, 16);  // can be adjusted to 32, 32 (1024 threads per block is the maximum)
dim3 numBlocks(data_width / threadsPerBlock.x,
               data_height / threadsPerBlock.y);
convolve_component_EXTEND_kern<<<numBlocks, threadsPerBlock>>>(d_data,
                                                               data_width,
                                                               data_height,
                                                               d_kern,
                                                               kern_w_f,
                                                               kern_h_f,
                                                               d_res);

cudaDeviceSynchronize();

cudaMemcpy(d_res,
           res,
           data_width * data_height * sizeof(JSAMPLE),
           cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_kern);
cudaFree(d_res);
}

In this context, the image data is contained in the array called data in such a way that the pixel at (i, j) is accessed by indexing into the array at data_width * j + i. the kernel data is in the array called kern, and it has a width of 2 * kern_w_f + 1 and a height of 2 * kern_h_f + 1. The element at (i, j) is accessed by indexing into the kern array at (2 * w_f + 1) * j + i, just like the data array. The array res is used to store the result of the convolution, and is allocated using calloc() before being passed to the function.

When I invoke the second function on an image's data, all the image's pixels are converted to 0 instead of the convolution being applied. Can anyone please point out the problem?

are you sure CUDA is actually working on your system? What CUDA-Version and what GPU are you using? Why are you using cudamallocManaged() with cudamempy()? — geebert, Jan 11 '21 at 12:09
I am using Cuda V10.1.243. The GPU is a GTX1060 mobile version. As for whether or not CUDA is actually working, I'm not sure. When I run it it does not indicate to me in any way that something has gone wrong. Can CUDA silently fail? — Mark Mizzi, Jan 11 '21 at 12:18
If your cuda version and driver do not work together properly it is perfectly possible that you will not get a warning about not finding a CUDA-device (as you are not checking for cuda errors), and then the kernel does nothing. So this is more about whether you can for example run the cuda samples, and obtain results there. — geebert, Jan 11 '21 at 12:23
one more observation ... after working on the data with the GPU, you call cudaMemcpy(d_res,res, ..), this should be cudaMemcpy(res, d_res, ..), as you have to flip dest and src. — geebert, Jan 11 '21 at 12:33

score 1 · Answer 1 · answered Jan 11 '21 at 12:40

Just after calling the kernel, and performing the convolution you try to copy your data back to the res array.

cudaDeviceSynchronize();

cudaMemcpy(d_res,
       res,
       data_width * data_height * sizeof(JSAMPLE),
       cudaMemcpyDeviceToHost);

this should be

cudaDeviceSynchronize();

cudaMemcpy(res,
       d_res,
       data_width * data_height * sizeof(JSAMPLE),
       cudaMemcpyDeviceToHost);

as the first argument of cudaMemcpy is the destination-pointer.

cudaError_t cudaMemcpy  ( void *dst, const void *src, size_t count, enum cudaMemcpyKind kind)

Yes, you're right. Thanks for the observation. – Mark Mizzi Jan 11 '21 at 13:20 — Mark Mizzi, Jan 11 '21 at 13:20

No result obtained from calculation

1 Answers1