CUDA: overloading of shared memory to implement reduction approach with multiple arrays

Question

I have 5 large size arrays A(N*5), B(N*5), C(N*5), D(N*5), E(N*2) number 5 and 2 represents the components of these variables in different planes/axes. That's why I have structured arrays in this fashion so I can visualize the data when I am writing my code. N ~ 200^3 ~ 8e06 nodes

For example : this is what my kernel looks like in its simplest form where I am doing all the calculations on global memory.

#define N 200*200*200
__global__ void kernel(doube *A, double *B, double *C, 
            double *D, double *E, double *res1, double *res2, 
            double *res3, double *res4 )
    {
       int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
        if(idx>=N) {return;}
        res1[idx]=0.; res2[idx]=0.; 
        res3[idx]=0.; res4[idx]=0.

        for (a=0; a<5; a++)
        {
            res1[idx] += A[idx*5+a]*B[idx*5+a]+C[idx*5+a] ;
            res2[idx] += D[idx*5+a]*C[idx*5+a]+E[idx*2+0] ;
            res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a] ;
            res4[idx] += C[idx*5+a]*E[idx*2+1]-D[idx*5+a] ;
        }

    }

I know that "for" loop can be eliminated but I have left it here as it is convenient to look at the code. This works but obviously it is extremely inefficient and slow for a Tesla K40 card even after removing the "for" loop. The arithmetics shown inside the "for" loop is just to give an idea, the actual calculations are much longer and convoluted with res1,res2... also getting in the mix.

I have implemented following with limited improvement, but i would like to improve it further with over-loading of shared memory.

    #define THREADS_PER_BLOCK 256
    __global__ void kernel_shared(doube *A, double *B, double *C, 
               double *D, double *E, double *res1, double *res2, 
               double *res3, double *res4  )
    {
       int a, idx=threadIdx.x + blockIdx.x * blockDim.x;
       int ix = threadIdx.x;
       __shared__ double A_sh[5*THREADS_PER_BLOCK];
       __shared__ double B_sh[5*THREADS_PER_BLOCK];
       __shared__ double C_sh[5*THREADS_PER_BLOCK];
       __shared__ double D_sh[5*THREADS_PER_BLOCK];
       __shared__ double E_sh[2*THREADS_PER_BLOCK];

       //Ofcourse this will not work for all arrays in shared memory; 
        so I am allowed  to put any 2 or 3 variables (As & Bs) of  
         my choice in shared and leave rest in the global memory. 

       for(int a=0; a<5; a++)
     {
        A_sh[ix*5 + a] = A[idx*5 + a] ;
        B_sh[ix*5 + a] = B[idx*5 + a] ;
     }
            __syncthreads();



    if(idx>=N) {return;}
        res1[idx]=0.; res2[idx]=0.; 
        res3[idx]=0.; res4[idx]=0.
    for (a=0; a<5; a++)
    {
        res1[idx] += A_sh[ix*5+a]*B_sh[ix*5+a]+C[idx*5+a];
        res2[idx] += B_sh[ix*5+a]*C[idx*5+a]+E[idx*2+0]  ;
        res3[idx] += E[idx*2+0]*D[idx*5+a]-C[idx*5+a]    ;
        res4[idx] += B_sh[ix*5+a]*E[idx*2+1]-D[idx*5+a]  ;
    }

}

This helps a little but I would like to implement one of those reduction approaches (without bank conflict)to improve performance where I can put all my variables in shared (may be tiling approach) and then do the calculation part. I saw the reduction example in the CUDA_Sample folder but that example works for sum over only one vector in share without any complex arithmetics involved over multiple arrays from shared memory. I would appreciate any help or suggestion to improve my existing kernel_shared approach to include the reduction approach.

The for-loop in your first piece of code doesn't make sense to me. Because you are using assignment operator (`=`), for example at the completion of the for-loop, `res1[idx]` will only contain the value computed on the last loop iteration. Should that be doing `+=` instead of `=` ? — Robert Crovella, Dec 19 '17 at 22:22
I'm voting to close this question as off-topic because it belongs on codereview.stackexchange.com (as does my answer, actually). @RobertCrovella: What do you think? — einpoklum, Dec 20 '17 at 00:07
You are using an AoS storage and access scheme. The first thing I would do if I were going after performance would be to [convert it to SoA](https://stackoverflow.com/questions/32233518/cuda-structure-alignment-is-slowing-down-my-code-compilable-example/32235059#32235059). I would do that before even thinking any bit about shared memory. I think the answer you have already is saying roughly the same thing. — Robert Crovella, Dec 20 '17 at 03:18
@einpoklum the comments are not really for sidebar discussions (you have meta or chat at your disposal, for that). Do as you see fit. It will be fine. My opinion here is irrelevant. — Robert Crovella, Dec 20 '17 at 03:20

score 1 · Answer 1 · answered Dec 20 '17 at 00:05

1. What you need is not shared memory

Examining your initial kernel, we notice that for every value of a, you're using at most 12 values in your computation of the four deltas to add up (probably less than 12, I didn't count exactly). That all fits perfectly well in your register file - even for double values: 12 * sizeof(double) , plus 4 * sizeof(double) for intermediate results makes 32 4-byte registers per thread. Well beyond the limit even if you have 1024 threads per block.

Now, the reasons your kernel runs slowly are mainly

2. Suboptimal memory access patterns

This is something you can read about in any presentation of CUDA programming; I'll just say briefly that instead of each thread handling several consecutive array elements on its own, you should instead interleave this among the lanes of a warp, or better yet the threads of a block. Thus instead of the thread global index idx handling

5 * idx
5 * idx + 1
...
5 * idx + 4

have it handle

5 * blockDim.x * blockIdx.x + threadIdx.x
5 * blockDim.x * blockIdx.x + threadIdx.x + blockDim.x
...
5 * blockDim.x * blockIdx.x + threadIdx.x + 4 * blockDim.x

so that whenever threads read or write, their reads and writes coalesce. In your case this might be a little more tricky because some of your accesses have a slightly different pattern but you get the idea.

3. Excessive addition to locations in global memory

This issue is more specific to your case. You see, you really don't need to change the resN[idx] value in global after every one of the additions, and you certainly don't care about reading the value that's in there whenever you're about to write. As your kernel stands, a single threads computes a new value for resN[idx] - so it can just add up things in a register, and write to resN[idx] when it's done (without even looking at its address).

If you change your memory access pattern as I've suggested in point 1., implementing the suggestion in point 2. becomes trickier, since you'll need to add up values from multiple lanes in the same warp, and perhaps make sure you don't cross warp boundaries with the reads relevant to a single computation. To learn how to do that I suggest you have a look at this presentation about shuffle-based reductions.

Hi einpoklum, Thanks for taking time for such detailed reply. I wanted to clarify about approach #1 in your reply, while the topic is still open. Are you suggesting to create a local variable A_reg,(which will by default stored in register) and then do all arithmetic on that local variable for each thread? A_reg[a]=A[idx*5+a]; B_reg[a]=B[idx*5+a] for each idx? — user2415927, Dec 20 '17 at 01:17
@user2415927 instead of `A_reg[a]=A[idx*5+a]` , use a single register (ie `A_reg = A[idx*5+a]`) and reuse it for the five iterations — BAdhi, Dec 20 '17 at 12:19
@user2415927: I suggest you have a local (register-assignable) variable for every value which you access more than once - read or write; and don't get rid of that value while you might still be using it. Same thing for the results - put them in registers. — einpoklum, Dec 20 '17 at 16:12

CUDA: overloading of shared memory to implement reduction approach with multiple arrays

1 Answers1

1. What you need is not shared memory

2. Suboptimal memory access patterns

3. Excessive addition to locations in global memory