TL;DR
I'm looking for a way to extract a part of an existing CUDA Toolkit example and turn it into a CUDAKernel executable in MATLAB.
The Story
In an attempt to obtain a short-runtime implementation of the non-local means (NLM) 2D filter, I stumbled upon the imageDenoising example provided with the CUDA Toolkit which implements two variants of this filter, called NLM & NLM2 (or "quick NLM").
Having no previous experience with CUDA coding, I initially attempted to follow MATLAB's documentation on the subject, which resulted in several strange errors including: ptx compilation, multiple entry points and wrong number of inputs in the C prototype. At this point I realized that this isn't going to be a "just works" case and that some tinkering is required.
So I decided to eliminate the multiple entry point issue by simply deleting parts of imageDenoising.cu
file and consolidating the relevant .cuh
(either ..._nlm_kernel.cuh
or ..._nlm2_kernel.cuh
) into the .cu
so as to obtain a single entry point at any given time.
To my surprise this actually managed to compile and I was finally able to create a CUDAKernel
without an error (using the command k = parallel.gpu.CUDAKernel('imageDenoising.ptx', 'uint8_T *, int, int, float, float');
).
This however was not enough, because I mistakenly concluded that the 1st argument is the unprocessed image in the form of an RGB matrix (i.e. X*Y*3 uint8
), and so the result I was getting back was exactly the input but with 0
in the 1st 4 elements.
After searching a bit more I realized that there are additional, and critical, aspects I'm entirely unaware of (like the need to initialize __device__
variables) to such a conversion process, at which stage I decided to ask for help.
The Problem
I'm currently wondering how to efficiently continue from here. While I'd love to hear if this kind of approach can generally bear fruit (and whether a complete example of this process is available somewhere), which other pitfalls I should look out for, and what alternative courses of action I can take (considering my very limited knowledge in CUDA and the fact I won't hire anybody else to do this for me), I keep in mind that this is SO and so I must have a specific programming problem, so here goes:
How do I modify
imageDenoising.cu
such that the MATLABCUDAKernel
constructed from it will also accept the unprocessed image as an input?
Note: in my application, the input matrix is a 2d, grayscale, double
matrix.
Related: How CudaMalloc work?
P.S.
A working piece of code would obviously be welcomed, but I'd really rather "learn to fish".