TLDR: I am trying to write a GPU code that computes a blockwise reduction on an array. The input looks like [block_0, trash_0, block_1, trash_1, ..., block_n, trash_n]
, and I want to compute block_0 + block_1 + ... + block_n
. Relevant sizes: each block (and trash-block) has size ~1,000 and there are ~10,000-100,000 blocks.
Longer Explanation/Context: I am working on a GPU code that computes the matrix-vector product of a matrix whose blocks are lower triangular and Toeplitz. Here is what I have so far:
Consider a block-row of the matrix T = [T_{i0}, T_{i1}, ..., T_{in}]
. Each T_{ij}
is lower triangular and Toeplitz, so for example will look like [[a, 0, 0],[b, a, 0], [c, b, a]]
. Consider also the vector v = [v_0, v_1, ..., v_n]
also in block form, so for example v_j
will look like [x, y, z]
. To get the i'th block of the output, we need to compute the sum T_{i0}v_0 + T_{i1}v1 + ... + T_{in}v_n
. Since each T_{ij}
is Toeplitz, we compute the matvecs T_{ij}v_j
as follows:
Pad the first column of
T_{ij}
andv_j
with zeros to twice the length. Continuing the 3x3 example from above, this looks likeT_{ij} -> [a, b, c, 0, 0, 0]
andv_j -> [x, y, z, 0, 0, 0]
. We do this for each blockj = 0...n
and join the padded vectors together. After this step, we will have the arraysT_i_pad -> [a0, b0, c0, 0, 0, 0, a1, b1, c1, 0, 0, 0, ... , an, bn, cn, 0, 0 0]
andv_pad -> [x0, y0, z0, 0, 0, 0, x1, y1, z1, 0, 0, 0, ..., xn, yn, zn]
.Take the FFT of the
T_i_pad
andv_pad
arrays. For this, I am using the cufft library with thecufftPlanMany()
function to do the FFTs in a batched way. I only deal with real inputs (double precision), so I use the D2Z options.Do a pointwise multiply of the FFT'd arrays and scale by
1/sqrt(length_of_each_padded_block)
(length_of_each_padded_block = 6
in this example).Take an inverse batched FFT (Z2D). The result will be an array the same size as the initial
v_pad
array and will contain the block matvecs I care about as well as some other terms that don't matter arranged as followsresult = [T_{i0}v_0, ***, T_{i1}v_1, ***, ..., T_{in}v_n, ***]
. In the 3x3 example, eachT_{ij}v_j
will have 3 elements and so will the***
.
All of this I have already done and is working correctly. The next step is to do a reduction over the result array so that I get the output T_{i0}v_0 + T_{i1}v1 + ... + T_{in}v_n
. This is where I am not sure what the best option is. Also, while I used a 3x3 example here, the scale I am actually looking to run this on is where each block has size O(10^3) and there are O(10^4) or O(10^5) such blocks.
The way I am currently implementing the reduction is using CUDA cooperative grids. I am using the standard binary-tree reduction algorithm at the threadblock level rather than at the thread level as is much more common. For this example, let us assume there are 16 blocks that need to be reduced, each one with 3 elements as above. The algorithm I have now proceeds as follows:
Launch a grid of 8 threadblocks. At the first stage of the binary tree, each block will read 2 of the blocks of
result
and compute a sum, writing to the left in global memory. For example, threadblock 0 will addT_{i0}v_0 + T_{i1}v_1
and write this overT_{i0}v_0
.Synchronize across blocks between each stage of the binary tree with the CUDA cooperative grids synchronize method. Continue the binary-tree reduction pattern.
At the end, the result will be stored where
T_{i0}v_0
originally was.
Here is the kernel code and the host code that calls it. Note that there are some further levels of complication to make sure that no more than the max number of blocks that can be run at one time are launched and to handle non-power-of-2 cases. I am also using the vector data type double2
since I read in a lot of places that it was better to use.
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cmath>
#include <cooperative_groups.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#define gpuErrchk(x) x
const int MAX_BLOCK_SIZE=1024;
static __device__ void VectorAdd(double2 *a, double2 *b, int size)
{
double2 tmp1, tmp2;
for (int i = threadIdx.x; i < size; i += blockDim.x)
{
tmp1 = a[i];
tmp2 = b[i];
a[i] = {tmp1.x + tmp2.x, tmp1.y + tmp2.y};
}
}
static __global__ void StridedVectorReduceKernel(double *d_in, int length, int num_vectors, int stride)
{
cooperative_groups::grid_group grid = cooperative_groups::this_grid();
int bid = blockIdx.x;
int offset = 1;
double2 * d_in2 = reinterpret_cast<double2 *>(d_in);
for (int s = gridDim.x; s > 1; s = (s+1)/2)
{
grid.sync();
int index = 2 * offset * bid;
int index2 = index + offset;
if (bid < s && index2 < num_vectors)
{
VectorAdd(d_in2 + (index * length * stride)/2, d_in2 + (index2 * length * stride)/2, (length + 1) / 2);
}
offset *= 2;
}
grid.sync();
if (bid == 0 && offset < num_vectors)
{
VectorAdd(d_in2, d_in2 + (offset * length * stride)/2, (length + 1) / 2);
}
}
__global__ void SVR2Kernel(volatile double *d_in, int num_vectors, int length, int stride, int *bc)
{
__shared__ int my_block;
int t = threadIdx.x;
double val;
for (int j = t; j < length; j += blockDim.x){
val = 0;
for (int i = blockIdx.x; i < num_vectors; i+=gridDim.x)
val += d_in[i*length+j];
d_in[blockIdx.x*length+j] = val;
}
__threadfence();
// use block draining
if (t == 0) my_block = atomicAdd(bc, 1);
__syncthreads();
if (my_block == gridDim.x-1){
// I am last block
for (int j = t; j < length; j += blockDim.x){
val = d_in[j];
for (int i = 1; i < gridDim.x; i++) val += d_in[i*length+j];
d_in[j] = val;
}
}
}
__global__ void SVR3Kernel(volatile double2 *d_in, int num_vectors, int length, int stride, int *bc)
{
__shared__ int my_block;
int t = threadIdx.x;
double2 val;
for (int j = t; j < length; j += blockDim.x){
val = {0, 0};
for (int i = blockIdx.x; i < num_vectors; i+=gridDim.x)
{
val.x += d_in[i*length + j].x;
val.y += d_in[i*length + j].y;
}
d_in[blockIdx.x*length+j].x = val.x;
d_in[blockIdx.x*length+j].y = val.y;
}
__threadfence();
// use block draining
if (t == 0) my_block = atomicAdd(bc, 1);
__syncthreads();
if (my_block == gridDim.x-1){
// I am last block
for (int j = t; j < length; j += blockDim.x){
val.x = d_in[j].x;
val.y = d_in[j].y;
for (int i = 1; i < gridDim.x; i++) {
val.x += d_in[i*length+j].x;
val.y += d_in[i*length+j].y;
}
d_in[j].x = val.x;
d_in[j].y = val.y;
}
}
}
void StridedVectorReduce(double * d_in, int num_vectors, int length, int stride, int device)
{
int numBlocksPerSm = 0;
cudaDeviceProp deviceProp;
gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));
int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
int num_threads = std::min((length + 1) / 2, MAX_BLOCK_SIZE);
dim3 dimBlock(num_threads, 1, 1);
int chunks = (num_vectors + 2*MaxNumBlocks -1)/(2*MaxNumBlocks);
double *start;
int num;
for (int i = 0; i < chunks - 1; i++)
{
num = 2 * MaxNumBlocks;
start = d_in + i * stride * num * length;
void *kernelArgs[] = {(void *)&start, (void *)&length, (void *)&num, (void *)&stride};
dim3 dimGrid(MaxNumBlocks, 1, 1);
gpuErrchk(cudaLaunchCooperativeKernel((void *)StridedVectorReduceKernel, dimGrid, dimBlock, kernelArgs));
}
num = num_vectors - 2 * (chunks - 1) * MaxNumBlocks;
start = d_in + (chunks - 1) * 2 * stride * MaxNumBlocks * length;
void *kernelArgs[] = {(void *)&start, (void *)&length, (void *)&num, (void *)&stride};
dim3 dimGrid((num+1)/2, 1, 1);
gpuErrchk(cudaLaunchCooperativeKernel((void *)StridedVectorReduceKernel, dimGrid, dimBlock, kernelArgs));
if (chunks == 1)
return;
StridedVectorReduce(d_in, chunks, length, stride * 2 * MaxNumBlocks, device);
}
void SVR2(double * d_in, int num_vectors, int length, int stride, int device, int *bc)
{
int numBlocksPerSm = 0;
cudaDeviceProp deviceProp;
gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));
int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
int num_threads = std::min((length + 1) / 2, MAX_BLOCK_SIZE);
dim3 dimBlock(num_threads, 1, 1);
dim3 dimGrid(MaxNumBlocks, 1, 1);
SVR2Kernel<<<dimGrid, dimBlock>>>(d_in, num_vectors, length, stride, bc);
}
void SVR3(double * d_in, int num_vectors, int length, int stride, int device, int *bc)
{
int numBlocksPerSm = 0;
cudaDeviceProp deviceProp;
gpuErrchk(cudaGetDeviceProperties(&deviceProp, device));
gpuErrchk(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, StridedVectorReduceKernel, MAX_BLOCK_SIZE, 0));
int MaxNumBlocks = numBlocksPerSm * deviceProp.multiProcessorCount;
int num_threads = std::min((length + 3) / 4, MAX_BLOCK_SIZE);
dim3 dimBlock(num_threads, 1, 1);
dim3 dimGrid(MaxNumBlocks, 1, 1);
SVR3Kernel<<<dimGrid, dimBlock>>>(reinterpret_cast<double2 *>(d_in), num_vectors, length/2, stride, bc);
}
int main(int argc, char **argv)
{
int device = 0;
gpuErrchk(cudaSetDevice(device));
int num_vectors = atoi(argv[1]);
int length = atoi(argv[2]);
double *r = (double *)malloc(sizeof(double)*length);
double *h_in = (double *)malloc(sizeof(double) * length * num_vectors);
for (int k = 0; k < num_vectors; k++)
{
for (int i = 0; i < length / 2; i++)
h_in[k * length + i] = i+1;
for (int i = length / 2; i < length; i++)
h_in[k * length + i] = 0.0;
}
double *d_in;
gpuErrchk(cudaMalloc((void **)&d_in, sizeof(double) * length * num_vectors));
gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
unsigned long long dt = dtime_usec(0);
StridedVectorReduce(d_in, num_vectors, length / 2, 2, device);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
int start = (length / 2 < 100) ? 0 : length / 2 - 10;
for (int i = start; i < length / 2; i++)
printf("h_out: %i %f \n", i, r[i]);
// output should be [1, 2, 3, ... , length / 2] * num_vectors
printf("duration 1: %f seconds\n", dt/(float)USECPSEC);
gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
int *bc;
cudaMalloc(&bc, sizeof(int));
cudaMemset(bc, 0, sizeof(int));
dt = dtime_usec(0);
SVR2(d_in, num_vectors, length, 2, device, bc);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
start = (length / 2 < 100) ? 0 : length / 2 - 10;
for (int i = start; i < length / 2; i++)
printf("h_out: %i %f \n", i, r[i]);
// output should be [1, 2, 3, ... , length / 2] * num_vectors
printf("duration 2 (double): %f seconds\n", dt/(float)USECPSEC);
gpuErrchk(cudaMemcpy(d_in, h_in, sizeof(double) * length * num_vectors, cudaMemcpyHostToDevice));
cudaMalloc(&bc, sizeof(int));
cudaMemset(bc, 0, sizeof(int));
dt = dtime_usec(0);
SVR3(d_in, num_vectors, length, 2, device, bc);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
gpuErrchk(cudaMemcpy(r, d_in, sizeof(double) * length, cudaMemcpyDeviceToHost));
start = (length / 2 < 100) ? 0 : length / 2 - 10;
for (int i = start; i < length / 2; i++)
printf("h_out: %i %f \n", i, r[i]);
// output should be [1, 2, 3, ... , length / 2] * num_vectors
printf("duration 3 (double2): %f seconds\n", dt/(float)USECPSEC);
free(h_in);
gpuErrchk(cudaFree(d_in));
return 0;
}
// output with num_vectors = 100000 and length = 5000
h_out: 2490 249100000.000000
h_out: 2491 249200000.000000
h_out: 2492 249300000.000000
h_out: 2493 249400000.000000
h_out: 2494 249500000.000000
h_out: 2495 249600000.000000
h_out: 2496 249700000.000000
h_out: 2497 249800000.000000
h_out: 2498 249900000.000000
h_out: 2499 250000000.000000
duration 1: 0.009309 seconds
h_out: 2490 249100000.000000
h_out: 2491 249200000.000000
h_out: 2492 249300000.000000
h_out: 2493 249400000.000000
h_out: 2494 249500000.000000
h_out: 2495 249600000.000000
h_out: 2496 249700000.000000
h_out: 2497 249800000.000000
h_out: 2498 249900000.000000
h_out: 2499 250000000.000000
duration 2 (double): 0.003667 seconds
h_out: 2490 249100000.000000
h_out: 2491 249200000.000000
h_out: 2492 249300000.000000
h_out: 2493 249400000.000000
h_out: 2494 249500000.000000
h_out: 2495 249600000.000000
h_out: 2496 249700000.000000
h_out: 2497 249800000.000000
h_out: 2498 249900000.000000
h_out: 2499 250000000.000000
duration 3 (double2): 0.004646 seconds
This algorithm does work correctly and computes the desired reduction. However, I have no idea whether it is the correct approach for this problem. I read about the Thrust strided iterators on some other posts, but that approach seems to be at least 2x slower than what I have here since I have to make a call to thrust::reduce()
for each element of the block (so O(10^3) calls). I also don't know if there is a way to have the IFFT output organized in a way that makes the reduction part easier.
Any guidance on this would be greatly appreciated! This is my first foray into GPU programming and my first ever question on Stack Overflow, so I apologize in advance for any mistakes on my part.
Edit: I updated the code to include a full test case. I realize I was not clear as to how I would call the StridedVectorReduce
function. How I have it now is length
refers to the size of each padded block, num_vectors
refers to the number of (padded) blocks, and stride
refers to the steps of size length
between each block we want to sum. So, the function is called as StridedVectorReduce(d_in, num_vectors, length / 2, 2, 0)
, where length/2
is the size of the unpadded block, and there is a step of 2 unpadded block sizes between each of the blocks we want to sum. I haven't included the gpuErrchk()
code so as to not make it too long, but it should work without that.
This code should have no errors under compute-sanitizer
. I have also removed the std::ceil()
calls and replaced them with the integer ceiling division. I will try the grid stride loop in the kernel next, though I am not sure how to only use one grid sync in the kernel. I thought we need to sync between each level of the binary tree.
Edit 2: Using the code from Robert Crovella's answer, I have updated the code above to include my implementation of their suggestion to use horizontal and vertical striding. There is no need to handle any strides other than 2 with the new algorithm, so that parameter is not actually used in the code. I also tried making a double 2
version to see if it would do any better. It ends up doing a bit worse than the pure double
version. I'm not sure if that's because of the lines like val.x = d_in[j].x; val.y = d_in[j].y;
causing multiple memory reads. I wanted to create a temp variable and do something like double2 tmp = d_in[j]
instead, but that was causing compiler errors since tmp
is not volatile. Anyways, thanks a lot for the help, and sorry for the incompleteness in my previous code. This one should compile directly as written.