GPU library that implements Image Convolution using cuFFT?

Question

I've been using the image convolution function from Nvidia Performance Primitives (NPP). However, my kernel is fairly large with respect to the image size, and I've heard rumors that NPP's convolution is a direct convolution instead of an FFT-based convolution. (I don't think the NPP source code is available, so I'm not sure how it's implemented.)

I'd like to see how fast a cuFFT-based convolution function could run in the image processing application that I'm working on.

You might say "hey, just put your image into cuFFT and see how fast it is!" And if I were using Matlab, you'd be right--it's a one-line call in Matlab:

%assuming the images are padded
convolved = ifft2(fft2(image).* fft2(filter));

However, there's a lot of boiler-plate stuff needed to get cuFFT to do image convolution. So, I'm looking for code that does a cuFFT-based convolution and abstracts away the implementation. And, indeed, I did find a few things:

This github repo has a file called cufft_sample.cu. I thought the code looked promising, but I found an other file in the repo containing comments that say convolution implementation is producing incorrect results:

    WARNING: GpuFFTConvOp currently don't return the good answer
    TODO: extend to cover more case, as in many case we will crash!

I had it in my head that the Kitware VTK/ITK codebase provided cuFFT-based image convolution. Alas, it turns out that (at best) doing cuFFT-based routines is planned for future releases.
I found some code on the Matlab File Exchange that does 2D convolution. The important parts are implemented in C/CUDA, but there's a Matlab wrapper. I'm working on stripping away the Matlab wrapper in favor of pure C/C++/CUDA, but I'm still curious whether there are any solutions that are more elegant and/or proven.

Any recommendations among these three options?

What else is out there in terms of pre-built code that does cuFFT-based image convolution?

score 3 · Accepted Answer · answered Nov 26 '12 at 07:32

3

You could try arrayfire.

In ArrayFire, you can do the following.

array image(rows, columns, h_image);
array filter(frows, fcols, h_filter);
array res = convolve(image, filter);

Depending on the size of the filter, the conolve command either uses cufft or a faster hand tuned kernel. If you prefer to use fft2, you could do the following

array res = ifft2(fft2(image) * fft2(filter));

But I highly recommend you use convolve instead because it has been optimized to get the best performance out of cufft.

More useful links:

Disclaimer:

ArrayFire is not open source. However it has a free to use version.
I work at AccelerEyes and develop arrayfire. I am linking to our product because @solvingPuzzles specifically asked for a library similar to what arrayfire is doing.

answered Nov 26 '12 at 07:32

Pavan Yalamanchili

12,021
2
35
55

Thanks! I just looked up the [convolution benchmarking results](http://www.accelereyes.com/products/benchmarks_arrayfire) for Arrayfire on your website. The page I linked has 1D and 2D separable convolution benchmarks. Do you know of any benchmark results on Arrayfire 2D nonseparable convolution? – solvingPuzzles Nov 26 '12 at 07:48
Also, having optional zero-padding arguments in the Arrayfire `fft2` and `ifft2` is pure genius. I wish the vanilla `cuFFT` code from Nvidia had this too! – solvingPuzzles Nov 26 '12 at 07:49
@solvingPuzzles I don't I have access to benchmarks for 2D convolution. However if you can specify sizes for Image and Filter, I can do a quick test and let you know how fast it is. – Pavan Yalamanchili Nov 26 '12 at 07:52
Let's try a couple of extremes: 9000x9000 image, 3x3 filter. 9000x9000 image, 5x5 filter. 9000x9000 image, 200x200 filter. Thanks for doing this! If results look good, I'll install Arrayfire and play around with it tomorrow. – solvingPuzzles Nov 26 '12 at 08:01
Also, while we're talking about arrayfire... Does the Linear Algebra portion of Arrayfire allow a user to plug custom functions into BLAS? For example, matrix-matrix multiply typically does a `dot product` of every pair of vectors in 2 matrices. Could I replace `dot product` with `L2 norm` so that I get a big batched L2-norm to get distances between vectors in 2 huge matrices? In the same vain, could I replace `dot product` with `min` so I get the element-wise min of of vectors in the huge matrices? – solvingPuzzles Nov 26 '12 at 08:02
2

24, 64 and 512 ms respectively. For the rest can you email me (you can find it on my profile) so that we can talk about it in more detail. – Pavan Yalamanchili Nov 26 '12 at 08:22

GPU library that implements Image Convolution using cuFFT?

1 Answers1

Linked