This doesn't give the GPU very much work to do other than a single add. The array will have to be a considerable size before you'd see a benefit. Anyway:
I use C++ and am not familiar with C# or CUDAfy, but it should be easy to port the logic. A kernel function that stores the sum of each pair of elements in an array is:
template<typename T>
__global__ void sum_combinations_of_array( const T* arr, const size_t len, T* dest )
{
const int tx = blockIdx.x*blockDim.x+threadIdx.x;
const int ty = blockIdx.y*blockDim.y+threadIdx.y;
if( tx < len && ty < len && tx < ty ) {
dest[tx*len+ty] = arr[tx]+arr[ty];
}
}
You're just using a 2D thread blocks to decide which elements of the array to add (they just take the place of i
and j
in your code). arr
should be at least len
in size, and dest
should be at least len*len
in size. The host code to set all of this up and run it would be something like:
const int len = 1000;
int* arr;
cudaMalloc( &arr, len*sizeof(int) );
int* matrix;
cudaMalloc( &matrix, len*len*sizeof(int) );
// cudaMalloc2D could also be used here, but then you'll
// have to pay attention to the pitch
cudaMemset( matrix, 0, len*len*sizeof(int) );
// copy host array to arr with cudaMemcpy
// ...
const int numThreads = ???; // depends on your hardware
dim3 grid( len, (len+numThreads-1)/numThreads ), threads( 1, numThreads );
sum_combinations_of_array<int><<<grid,threads>>>( arr, len, matrix );
cudaDeviceSynchronize(); // wait for completion
// copy device matrix to host with cudaMemcpy (or cudaMemcpy2D)
// remember any element i<=j will be 0
// ...
cudaFree( arr );
cudaFree( matrix );