2

I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements - set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.

What is the most efficient way to run a CUDA or OpenCL kernel only over elements from S? Can I run a kernel over array A? All examples I've seen so far deal with contiguous 1D, 2D, or 3D arrays. Is there any problem with introducing one layer of indirection?

Paul Jurczak
  • 7,008
  • 3
  • 47
  • 72
  • Is the large array already GPU-resident? I remember transfer rates to the GPU not being so great, so just copying A to the GPU may save time, especially if the test for A-ness is O(1). – rampion Aug 15 '10 at 02:53
  • It is GPU resident (I edited the question to reflect this). – Paul Jurczak Aug 15 '10 at 03:29
  • CUDA architecture expose different kinds of memory. Which are you using? – karlphillip Aug 15 '10 at 03:53
  • I have not done any CUDA or OpenCL programming yet. I have a reasonably good general understanding of GPGPU, but no practical experience. Answer to this question will help me to decide how well GPGPU is applicable to my set of problems. – Paul Jurczak Aug 15 '10 at 06:15

3 Answers3

4

In CUDA contiguous (not random) memory access is preferred due to possible use of memory coalescing. It's not a big deal to create array of randomly distributed indexes and proceed one index from A per thread, something like this:

__global__ kernel_func(unsigned * A, float * S)
{
    const unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
    const unsigned S_idx = A[idx];

    S[S_idx] *= 5; // for example...
    ...
}

But memory access to S[random access] will be very slow (here will be a most possible bottleneck).

If you decide to use CUDA, then you must experimenting a lot with blocks/grid sizes, minimize register consumption per thread (to maximize number of blocks per multiprocessor) and maybe sort A to use nearest S_ind from nearest threads...

KoppeKTop
  • 648
  • 4
  • 7
1

if you sort your indexes or build the list sorted that will help performance allot, if there are clusters of indexes then try using texture memory, and if you are accessing a number of elements from each thread with some over lap the i found using the shared memory gives a significant performance boost.

Eri
  • 51
  • 1
1

No problem at all with the one level of indirection. I use that a fair amount in my own CUDA code. Is the set S likely to remain static over time? If so, it may very well be worth generating the lookup A like you said.

Also, texture memory will be your friend in providing cache locality. The type of texture you use (1D, 2D, or 3D) will depend on your problem.

peakxu
  • 6,667
  • 1
  • 28
  • 27