Getting the imageDenoising CUDA example to work using a MATLAB CUDAKernel

Question

TL;DR

I'm looking for a way to extract a part of an existing CUDA Toolkit example and turn it into a CUDAKernel executable in MATLAB.

The Story

In an attempt to obtain a short-runtime implementation of the non-local means (NLM) 2D filter, I stumbled upon the imageDenoising example provided with the CUDA Toolkit which implements two variants of this filter, called NLM & NLM2 (or "quick NLM").

Having no previous experience with CUDA coding, I initially attempted to follow MATLAB's documentation on the subject, which resulted in several strange errors including: ptx compilation, multiple entry points and wrong number of inputs in the C prototype. At this point I realized that this isn't going to be a "just works" case and that some tinkering is required.

So I decided to eliminate the multiple entry point issue by simply deleting parts of imageDenoising.cu file and consolidating the relevant .cuh (either ..._nlm_kernel.cuh or ..._nlm2_kernel.cuh) into the .cu so as to obtain a single entry point at any given time. To my surprise this actually managed to compile and I was finally able to create a CUDAKernel without an error (using the command k = parallel.gpu.CUDAKernel('imageDenoising.ptx', 'uint8_T *, int, int, float, float');).

This however was not enough, because I mistakenly concluded that the 1st argument is the unprocessed image in the form of an RGB matrix (i.e. X*Y*3 uint8), and so the result I was getting back was exactly the input but with 0 in the 1st 4 elements.

After searching a bit more I realized that there are additional, and critical, aspects I'm entirely unaware of (like the need to initialize __device__ variables) to such a conversion process, at which stage I decided to ask for help.

The Problem

I'm currently wondering how to efficiently continue from here. While I'd love to hear if this kind of approach can generally bear fruit (and whether a complete example of this process is available somewhere), which other pitfalls I should look out for, and what alternative courses of action I can take (considering my very limited knowledge in CUDA and the fact I won't hire anybody else to do this for me), I keep in mind that this is SO and so I must have a specific programming problem, so here goes:

How do I modify imageDenoising.cu such that the MATLAB CUDAKernel constructed from it will also accept the unprocessed image as an input?

Note: in my application, the input matrix is a 2d, grayscale, double matrix.

Related: How CudaMalloc work?

P.S.

A working piece of code would obviously be welcomed, but I'd really rather "learn to fish".

Dev-iL · Accepted Answer · 2015-01-23T20:56:25.950

I ended up taking an alternative approach to CUDAKernel, using .MEX, by doing the following:

Setting up the external libraries OpenCV v2.4.10 (not v3!) and mexopencv.
Writing a small wrapper function for OpenCV's fastNlMeansDenoising using the guidelines of mexopencv for unimplemented functions, as seen below (excluding the documentation):

#include "mexopencv.hpp"
using namespace cv;

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[])
{
    // Check arguments
    if (nlhs != 1 || nrhs<1 || ((nrhs % 2) != 1) )
        mexErrMsgIdAndTxt("fastNLM:invalidArgs", "Wrong number of arguments");

    // Argument vector  
    vector<MxArray> rhs(prhs, prhs + nrhs);

    // Option processing
      // Defaults:
    double h = 3;
    int templateWindowSize = 7;
    int searchWindowSize = 21;
      // Parsing input name-value pairs:
    for (int i = 1; i<nrhs; i += 2) {
        string key = rhs[i].toString();
        if (key == "h")
            h = rhs[i + 1].toDouble();
        else if (key == "templateWindowSize")
            templateWindowSize = rhs[i + 1].toInt();
        else if (key == "searchWindowSize")
            searchWindowSize = rhs[i + 1].toInt();
        else
            mexErrMsgIdAndTxt("mexopencv:error", "Unrecognized option");
    }

    // Process
    Mat src(rhs[0].toMat()), dst;
    fastNlMeansDenoising(src, dst, h, templateWindowSize, searchWindowSize);

    // Convert cv::Mat back to mxArray*
    plhs[0] = MxArray(dst);
}

Compiling it..... and viola - a working CUDA-accelerated NLM filter.

The answer to my question itself can be found by comparing opencv\sources\modules\photo\src\cuda\nlm.cu (this is the opencv2 path) with imageDenoising_nlm2_kernel.cuh.

This solution worked well for me because it was more important for me to get an NLM filter running, rather than using CUDAKernel.

The main lesson I learned from this (and I'd like to pass on to others) is:

Running CUDA code in MATLAB can also be done in ways other than CUDAKernel, such as using .mex wrappers as shown above.

Getting the imageDenoising CUDA example to work using a MATLAB CUDAKernel

TL;DR

The Story

The Problem

1 Answers1