I have 5 large size arrays A(N*5), B(N*5), C(N*5), D(N*5), E(N*2) number 5 and 2 represents the components of these variables in different planes/axes. That's why I have structured arrays in this fashion so I can visualize the data when I am writing my code. N ~ 200^3 ~ 8e06 nodes
For example : this is what my kernel looks like in its simplest form where I am doing all the calculations on global memory.
#define N 200*200*200
__global__ void kernel(doube *A, double *B, double *C,
double *D, double *E, double *res1, double *res2,
double *res3, double *res4 )
{
int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
if(idx>=N) {return;}
res1[idx]=0.; res2[idx]=0.;
res3[idx]=0.; res4[idx]=0.
for (a=0; a<5; a++)
{
res1[idx] += A[idx*5+a]*B[idx*5+a]+C[idx*5+a] ;
res2[idx] += D[idx*5+a]*C[idx*5+a]+E[idx*2+0] ;
res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a] ;
res4[idx] += C[idx*5+a]*E[idx*2+1]-D[idx*5+a] ;
}
}
I know that "for" loop can be eliminated but I have left it here as it is convenient to look at the code. This works but obviously it is extremely inefficient and slow for a Tesla K40 card even after removing the "for" loop. The arithmetics shown inside the "for" loop is just to give an idea, the actual calculations are much longer and convoluted with res1,res2... also getting in the mix.
I have implemented following with limited improvement, but i would like to improve it further with over-loading of shared memory.
#define THREADS_PER_BLOCK 256
__global__ void kernel_shared(doube *A, double *B, double *C,
double *D, double *E, double *res1, double *res2,
double *res3, double *res4 )
{
int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
int ix = threadIdx.x;
__shared__ double A_sh[5*THREADS_PER_BLOCK];
__shared__ double B_sh[5*THREADS_PER_BLOCK];
__shared__ double C_sh[5*THREADS_PER_BLOCK];
__shared__ double D_sh[5*THREADS_PER_BLOCK];
__shared__ double E_sh[2*THREADS_PER_BLOCK];
//Ofcourse this will not work for all arrays in shared memory;
so I am allowed to put any 2 or 3 variables (As & Bs) of
my choice in shared and leave rest in the global memory.
for(int a=0; a<5; a++)
{
A_sh[ix*5 + a] = A[idx*5 + a] ;
B_sh[ix*5 + a] = B[idx*5 + a] ;
}
__syncthreads();
if(idx>=N) {return;}
res1[idx]=0.; res2[idx]=0.;
res3[idx]=0.; res4[idx]=0.
for (a=0; a<5; a++)
{
res1[idx] += A_sh[ix*5+a]*B_sh[ix*5+a]+C[idx*5+a];
res2[idx] += B_sh[ix*5+a]*C[idx*5+a]+E[idx*2+0] ;
res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a] ;
res4[idx] += B_sh[ix*5+a]*E[idx*2+1]-D[idx*5+a] ;
}
}
This helps a little but I would like to implement one of those reduction approaches (without bank conflict)to improve performance where I can put all my variables in shared (may be tiling approach) and then do the calculation part. I saw the reduction example in the CUDA_Sample folder but that example works for sum over only one vector in share without any complex arithmetics involved over multiple arrays from shared memory. I would appreciate any help or suggestion to improve my existing kernel_shared approach to include the reduction approach.