Sparse array in CUDA or OpenCL

Question

I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements - set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.

What is the most efficient way to run a CUDA or OpenCL kernel only over elements from S? Can I run a kernel over array A? All examples I've seen so far deal with contiguous 1D, 2D, or 3D arrays. Is there any problem with introducing one layer of indirection?

Is the large array already GPU-resident? I remember transfer rates to the GPU not being so great, so just copying A to the GPU may save time, especially if the test for A-ness is O(1). — rampion, Aug 15 '10 at 02:53
CUDA architecture expose different kinds of memory. Which are you using? — karlphillip, Aug 15 '10 at 03:53
I have not done any CUDA or OpenCL programming yet. I have a reasonably good general understanding of GPGPU, but no practical experience. Answer to this question will help me to decide how well GPGPU is applicable to my set of problems. — Paul Jurczak, Aug 15 '10 at 06:15

score 4 · Accepted Answer · answered Aug 15 '10 at 08:05

In CUDA contiguous (not random) memory access is preferred due to possible use of memory coalescing. It's not a big deal to create array of randomly distributed indexes and proceed one index from A per thread, something like this:

__global__ kernel_func(unsigned * A, float * S)
{
    const unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
    const unsigned S_idx = A[idx];

    S[S_idx] *= 5; // for example...
    ...
}

But memory access to S[random access] will be very slow (here will be a most possible bottleneck).

If you decide to use CUDA, then you must experimenting a lot with blocks/grid sizes, minimize register consumption per thread (to maximize number of blocks per multiprocessor) and maybe sort A to use nearest S_ind from nearest threads...

score 1 · Answer 2 · answered Aug 15 '10 at 15:55

if you sort your indexes or build the list sorted that will help performance allot, if there are clusters of indexes then try using texture memory, and if you are accessing a number of elements from each thread with some over lap the i found using the shared memory gives a significant performance boost.

score 1 · Answer 3 · answered Jan 18 '11 at 18:47

No problem at all with the one level of indirection. I use that a fair amount in my own CUDA code. Is the set S likely to remain static over time? If so, it may very well be worth generating the lookup A like you said.

Also, texture memory will be your friend in providing cache locality. The type of texture you use (1D, 2D, or 3D) will depend on your problem.

Sparse array in CUDA or OpenCL

3 Answers3