Equivalent of Matlab's find command in CUDA

Question

I'd like to find non-zero elements of a matrix as fast as possible. Having CUDA \ Jacket in mind, I've learned that this is much slower than the "regular" CPU version of Matlab's find, probably due to memory allocation issues, since the size of the output is not known prior to the find function. However, using bwlabel and regionprops (both supported in Jacket) does effectively yield info regarding the non-zero elements, and much faster than Matlab's built in Image Processing Toolbox functions. Is there a way to harness this to get the non-zero elements? Is there instead a way to do some processing on each of the labeled objects that are found using bwlabel?

@nate, can you post some code of what you are doing and how you are benchmarking ? find is one of the faster functions in Jacket, and you should not be having any trouble. Also mention if you are using sparse matrices. — Pavan Yalamanchili, Aug 26 '12 at 04:06
@pavan , see my response to gpu below. Jacket is fine as long as you feed it big enough matrices. I somehow forgot that... — bla, Aug 26 '12 at 04:42

score 2 · Accepted Answer · answered Aug 25 '12 at 21:50

In my experience, the FIND implementation supported by Jacket is very fast, at least for matrices larger than 300x300 or so. I tested this on my laptop and share the results in what follows. My HW specs are:

>> ginfo
Jacket v2.2 (build 77be88c) by AccelerEyes (64-bit Windows)
License: Standalone (C:\Program Files\AccelerEyes\Jacket\2.2\engine\jlicense.dat)
Addons: MGL16, JMC, SDK, DLA, SLA
CUDA toolkit 4.2, driver 4.2 (296.10)
GPU1 GeForce GT 540M, 2048 MB, Compute 2.1 (single,double)
Display Device: GPU1 GeForce GT 540M
Memory Usage: 1697 MB free (2048 MB total)

CPU is Intel Core i7-2630QM.

I get that Jacket hits ~3X speedups on the FIND function over the CPU. Here is the benchmark code I used:

% time Jacket vs CPU
for n = 5:12;
    x(n) = 2^n;
    Ac = single(rand(x(n)));
    Ag = gsingle(Ac);
    t_cpu(n) = timeit(@() find(Ac > 0.5));
    t_gpu(n) = timeit(@() find(Ag > 0.5));
end

% plot results
plot(x, t_cpu ./ t_gpu);
xlabel('Matrix Edge Size', 'FontSize', 14);
ylabel('Jacket (GPU) Speedup over CPU', 'FontSize', 14);

Here are the results of running this code:

Jacket (GPU) Speedup over CPU

I'm sure the BWLABEL and REGIONPROPS functions supported by Jacket are also very fast, but you might be able to get by with FIND itself given the benchmarks above.

I embarrassingly fell again into the stupid thought of feeding the gpu small blocks of signal that were up to 50x50 size, and thinking that will solve it better. So by calculating many times '[idx idy]=find(gdouble(m)>threshold);' where m is a 10x10 to 50x50 signal matrix that is obtained serially, I got using 'timeit' times that were 5-20 times slower per find iteration... Instead, accumulating a larger matrix indeed shows the factor 3 improvement (and I use gdouble once). Sorry for my stupidity, and thanks again... — bla, Aug 26 '12 at 04:40

score 1 · Answer 2 · answered Jul 30 '18 at 05:06

Another possibility, besides that discussed by arrayfire, is using CUDA Thrust. Below, I'm posting a simple example in which particles on a 2D domain are selected according to an (x, y) threshold. I whould be easily adapted to the case of interest to the poster or, more generally, to simulate the behavior of Matlab's find.

CODE

#include <thrust/gather.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <iostream>

struct isWithinThreshold {

    double2 thresholdPoint;

    __host__ __device__ isWithinThreshold(double2 thresholdPoint_) { thresholdPoint = thresholdPoint_; };

    __host__ __device__ bool operator()(const double2 r) {
        return ((r.x > thresholdPoint.x) && (r.y > thresholdPoint.y));
    }
};

/********/
/* MAIN */
/********/
int main()
{
    const int N = 5;
    double2 thresholdPoint = make_double2(0.5, 0.5);

    thrust::host_vector<double2> particleCoordinates(N);
    particleCoordinates[0].x = 0.45;    particleCoordinates[0].y = 0.4;
    particleCoordinates[1].x = 0.1;     particleCoordinates[1].y = 0.3;
    particleCoordinates[2].x = 0.8;     particleCoordinates[2].y = 0.9;
    particleCoordinates[3].x = 0.7;     particleCoordinates[3].y = 0.9;
    particleCoordinates[4].x = 0.7;     particleCoordinates[4].y = 0.45;

    // --- Find out the indices
    thrust::host_vector<int> indices(N);
    thrust::host_vector<int>::iterator end = thrust::copy_if(thrust::make_counting_iterator(0),
        thrust::make_counting_iterator(N),
        particleCoordinates.begin(),
        indices.begin(),
        isWithinThreshold(thresholdPoint));
    int size = end - indices.begin();
    indices.resize(size);

    // --- Fetch values corresponding to the selected indices
    thrust::host_vector<double2> values(size);
    thrust::copy(thrust::make_permutation_iterator(particleCoordinates.begin(), indices.begin()),
        thrust::make_permutation_iterator(particleCoordinates.end(), indices.end()),
        values.begin());

    for (int k = 0; k < size; k++)
        printf("k = %d; index = %d; value.x = %f; value.y = %f\n", k, indices[k], values[k].x, values[k].y);

    return 0;
}

Equivalent of Matlab's find command in CUDA

2 Answers2