Adding all combinations of array elements using GPU

Question

I am new to CUDAfy programming and having trouble calculating the sum of all of the combinations of elements in an array. I can't seem to think of an algorithm suitable enough to be ported on a GPU. Any help or any sort of algorithm is very much appreciated.

Serial version of code is given below:

for (int i = 0; i < Array.Count - 1; i++)
{
    for (int j = (i + 1); j < Array.Count; j++)
    {
         ans.Add(Array.ElementAt(i) + Array.ElementAt(j));     

    }
}

I am having a similar problem as a matter of fact. I am new to this too. I believe the it might be possible to store the elements of the array in two different arrays in a fashion that directly adding the two element by element gives you the required result. But it would require quite some instructions and comparisons to get those arrays in the first place and might not be the best way to do it. Hope this helps... — Alchemist, Dec 09 '15 at 10:11

score 2 · Accepted Answer · answered Dec 10 '15 at 03:18

This doesn't give the GPU very much work to do other than a single add. The array will have to be a considerable size before you'd see a benefit. Anyway:

I use C++ and am not familiar with C# or CUDAfy, but it should be easy to port the logic. A kernel function that stores the sum of each pair of elements in an array is:

template<typename T>
__global__ void sum_combinations_of_array( const T* arr, const size_t len, T* dest )
{
    const int tx = blockIdx.x*blockDim.x+threadIdx.x;
    const int ty = blockIdx.y*blockDim.y+threadIdx.y;
    if( tx < len && ty < len && tx < ty ) {
        dest[tx*len+ty] = arr[tx]+arr[ty];
    }
}

You're just using a 2D thread blocks to decide which elements of the array to add (they just take the place of i and j in your code). arr should be at least len in size, and dest should be at least len*len in size. The host code to set all of this up and run it would be something like:

const int len = 1000;

int* arr;
cudaMalloc( &arr, len*sizeof(int) );

int* matrix;
cudaMalloc( &matrix, len*len*sizeof(int) );
// cudaMalloc2D could also be used here, but then you'll
// have to pay attention to the pitch
cudaMemset( matrix, 0, len*len*sizeof(int) );

// copy host array to arr with cudaMemcpy
// ...

const int numThreads = ???; // depends on your hardware
dim3 grid( len, (len+numThreads-1)/numThreads ), threads( 1, numThreads );
sum_combinations_of_array<int><<<grid,threads>>>( arr, len, matrix );
cudaDeviceSynchronize(); // wait for completion

// copy device matrix to host with cudaMemcpy (or cudaMemcpy2D)
// remember any element i<=j will be 0
// ...

cudaFree( arr );
cudaFree( matrix );

Adding all combinations of array elements using GPU

1 Answers1