Matrix multiplication on GPU. Memory bank conflicts and latency hiding

Question

Edit: achievements over time is listed at the end of this question(~1Tflops/s yet).

Im writing some kind of math library for C# using opencl(gpu) from C++ DLL and already done some optimizations on single precision square matrix-matrix multiplicatrion(for learning purposes and possibility of re-usage in a neural-network program later). Below kernel code gets v1 1D array as rows of matrix1(1024x1024) and v2 1D array as columns of matrix2((1024x1024)transpose optimization) and puts the result in v3 1D array as matrix-3's rows.(1024x1024)

For now, kernel execution time for 1024x1024 square matrix-matrix multiplication is 3.6 ms for HD7870.

Optimizations done:

Transpozition of second matrix.(improved time)
Computing in local memory with using 32x32 sub-matrices(4x 16x16 because maximum workgroup size is 256 on my HD7870 and gpu doesnt accept more than 24kB local for some reason but online sources say 64kB?)(anyway, improved time by a good margin)
Increasing data re-using with private variables before writing the result in local and global.(improved time)
Column major accessing to local 2D arrays in innermost loop. (improved time)
Sharing addition into two accumulator registers per patch. (improved time and decreased numerical stability)
Loop-unrolling the innermost loop did not improve time(even got worse after 4th unroll)(so integer alu's must be relaxed)

Question: I couldnt finish some optimizations such as eliminating all local(lds) bank conflicts and instruction re-ordering to hide memory latency. What can I do to polish this math function's performance?

This kernel is certainly local-memory bandwidth(conflict) bounded, having 3.2 ms for multiplication=

(1024*1024*1024 * (1 sum + 1 mult =2) / 0.0036 seconds )= 596x10^9 Flops per second(596 GFlops) I saw some online benchmark of CUDA on GTX680 and they have broken 1TFlops point. Because it has more local memory per compute unit or more cores or both?

(1024*1024*1024*(2 float reads)*(4 bytes per float) /0.0036 sec)=2386x10^9 bytes per second But this kernel reads 8 floats and uses them for 16 times which has data re-use of 2 per float.

2386x10^9 bytes / re-use(2) = 1193 GB/s

Theoretical maximas for HD7870 are:here, appendix D

Compute power=2560 Giga Floating point operations per second, LDS bandwidth=2560 GB/s and register access bandwidth=15360 GB/s

Here is kernel:

__kernel void squareGpuMatrixMul(__global float * v1, __global float * v2, __global float * v3) 
{
    int localRow = get_local_id(0); 
    int localCol = get_local_id(1);  
    int selectRowFromA = get_group_id(0)*32;     
    int selectColFromB = get_group_id(1)*32;     
    int lid= localCol*16+localRow; 
    __local float Lcache1[ 16][ 16]; 
    __local float Lcache2[ 16][ 16]; 
    __local float Lcache3[ 16][ 16]; 

    __local float Lcache1a[ 16][ 16]; 
    __local float Lcache2a[ 16][ 16]; 
    __local float Lcache3a[ 16][ 16]; 

    __local float Lcache1b[ 16][ 16]; 
    __local float Lcache2b[ 16][ 16]; 
    __local float Lcache3b[ 16][ 16]; 

    __local float Lcache1c[ 16][ 16]; 
    __local float Lcache2c[ 16][ 16]; 
    __local float Lcache3c[ 16][ 16]; 

    float tmp0=0.0f; 
    float tmp1=0.0f; 
    float tmp2=0.0f; 
    float tmp3=0.0f; 

    float tmp4=0.0f; 
    float tmp5=0.0f; 
    float tmp6=0.0f; 
    float tmp7=0.0f; 

    float sumPatch=0.0f; 
    float sumPatcha=0.0f; 
    float sumPatchb=0.0f; 
    float sumPatchc=0.0f; 
    float sumPatch2=0.0f; 
    float sumPatcha2=0.0f; 
    float sumPatchb2=0.0f; 
    float sumPatchc2=0.0f; 

    barrier(CLK_LOCAL_MEM_FENCE); 
    Lcache3[localRow][localCol]=0.0f; 
    Lcache3a[localRow][localCol]=0.0f; 
    Lcache3b[localRow][localCol]=0.0f; 
    Lcache3c[localRow][localCol]=0.0f; 
    barrier(CLK_LOCAL_MEM_FENCE); 
    for(int i=0;i<1024;i+=32)  // this is A's row and B's column parsed by sub-matrices
    { 
        barrier(CLK_LOCAL_MEM_FENCE); 
        Lcache1[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024];
        Lcache2[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024];
        Lcache1a[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024+ 16];
        Lcache2a[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024+ 16];
        Lcache1b[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024+16384];
        Lcache2b[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024+16384];
        Lcache1c[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024+ 16+16384];
        Lcache2c[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024+ 16+16384];
        barrier(CLK_LOCAL_MEM_FENCE); 
        sumPatch=0.0f; 
        sumPatcha=0.0f; 
        sumPatchb=0.0f; 
        sumPatchc=0.0f; 
        sumPatch2=0.0f; 
        sumPatcha2=0.0f; 
        sumPatchb2=0.0f; 
        sumPatchc2=0.0f; 
        for(int kk=0;kk< 16;kk++) //this is sub-matrix multiplication
        {   
            read_mem_fence(CLK_LOCAL_MEM_FENCE); 
            tmp0=Lcache1[kk][localRow];  // row-major
            tmp1=Lcache1a[kk][localRow]; // accesses
            tmp2=Lcache1b[kk][localRow]; //to local memory
            tmp3=Lcache1c[kk][localRow]; 
            tmp4=Lcache2[kk][localCol]; 
            tmp5=Lcache2a[kk][localCol]; 
            tmp6=Lcache2b[kk][localCol]; 
            tmp7=Lcache2c[kk][localCol]; 
            read_mem_fence(CLK_LOCAL_MEM_FENCE); 
            sumPatch+=tmp0*tmp4; 
            sumPatcha+=tmp0*tmp6; 
            sumPatchb+=tmp2*tmp4; 
            sumPatchc+=tmp2*tmp6; 
            sumPatch2+=tmp1*tmp5; 
            sumPatcha2+=tmp1*tmp7; 
            sumPatchb2+=tmp3*tmp5; 
            sumPatchc2+=tmp3*tmp7; 
        } 
        Lcache3[localRow][localCol]+=sumPatch+sumPatch2; 
        Lcache3a[localRow][localCol]+=sumPatcha+sumPatcha2; 
        Lcache3b[localRow][localCol]+=sumPatchb+sumPatchb2; 
        Lcache3c[localRow][localCol]+=sumPatchc+sumPatchc2; 
    } 
    barrier(CLK_LOCAL_MEM_FENCE); 
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024]=Lcache3[localRow][localCol];                   
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024+ 16]=Lcache3a[localRow][localCol];              
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024+16384]=Lcache3b[localRow][localCol];     
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024+ 16+16384]=Lcache3c[localRow][localCol];     
    barrier(CLK_LOCAL_MEM_FENCE); 
}

Here is what Ive tried to eliminate bank conflicts, but kernel execution time increased by around %20:

for(int kk=0;kk< 16;kk++) 
{   
    int nc=(kk+lid)&15;//different for all local threads
                       //but does not exceed 0-15 range
                       //summation order is not important
                       //0.+1.+...15. or 14.+15.+0.+..13.
                       //gives correct answer
    read_mem_fence(CLK_LOCAL_MEM_FENCE); 
    tmp0=Lcache1[nc][localRow]; 
    tmp1=Lcache1a[nc][localRow]; 
    tmp2=Lcache1b[nc][localRow]; 
    tmp3=Lcache1c[nc][localRow]; 
    tmp4=Lcache2[nc][localCol]; 
    tmp5=Lcache2a[nc][localCol]; 
    tmp6=Lcache2b[nc][localCol]; 
    tmp7=Lcache2c[nc][localCol]; 
    read_mem_fence(CLK_LOCAL_MEM_FENCE);
    sumPatch+=tmp0*tmp4;
    sumPatcha+=tmp0*tmp6;
    sumPatchb+=tmp2*tmp4;
    sumPatchc+=tmp2*tmp6;
    sumPatch2+=tmp1*tmp5;
    sumPatcha2+=tmp1*tmp7;
    sumPatchb2+=tmp3*tmp5;
    sumPatchc2+=tmp3*tmp7;
}

Could this be broadcasting technology of new gpus? Also summation over 16 elements means only 16 banks are used? The device has 32 banks for local access.

Here is what Ive tried to hide memory latency:

for(int kk=0;kk< 16;kk++) 
{   
    int nc=(kk+lid)&15;//different for all local threads
                       //but does not exceed 0-15 range
                       //summation order is not important
                       //0.+1.+...15. or 14.+15.+0.+..13.
                       //gives correct answer
    read_mem_fence(CLK_LOCAL_MEM_FENCE); 
    tmp0=Lcache1[nc][localRow]; 
    tmp4=Lcache2[nc][localCol];
    sumPatch+=tmp0*tmp4; 
    tmp6=Lcache2b[nc][localCol];
    sumPatcha+=tmp0*tmp6; 
    tmp1=Lcache1a[nc][localRow];
    tmp7=Lcache2c[nc][localCol]; 
    sumPatcha2+=tmp1*tmp7; 
    tmp5=Lcache2a[nc][localCol];
    sumPatch2+=tmp1*tmp5; 
    tmp2=Lcache1b[nc][localRow]; 
    sumPatchb+=tmp2*tmp4;
    sumPatchc+=tmp2*tmp6; 
    tmp3=Lcache1c[nc][localRow]; 
    sumPatchb2+=tmp3*tmp5;
    sumPatchc2+=tmp3*tmp7;  
    read_mem_fence(CLK_LOCAL_MEM_FENCE);//this lines' position does not change time 
}

But this did not increase or decrease exec. time.

How can I improve kernel time? Doable?

Device: HD7870 @ 1000MHz/1200MHz Host: FX8150@4GHz Headers,LIB files from Khronos's Site, opencl.dll from AMD's drivers.

Time sampling is done with: cycyling the kernel for 100 times and dividing total time by 100.0 from a Stopwatch method as start() and stop(). And only for execution, array copies not included.

All results are compared against naive 3-nested-looped version with same inputs of random-matrices (results are within m(ij)+/-delta where delta is 0.001f. )

Kernel here is simplificated version of a more generalized one(for different matrix and patch sizes)

Kernel parameter of this version: Global= 512,512 Local=16,16, Referance=0,0

For 8320x8320 matrix --->Global=4160,4160, Local=16,16, ref=0,0 time = 1.87Seconds

Edit: Replacing local Lcache3 by private version improved 1024x1024 time to 2.7 ms with suggestion by DarkZeros. This is 795 GFlops per second. This must be from better occupation ratio.

Edit2: Lesser local usage opened possibility of using 48x48 (9 x 16x16) patches which made 1056x1056 multiplication 2.4 ms ---->981 Gflops/s. 8208x8208 is done in 961ms which is more than 1150 GFlops.

Compiler will already take this into account, so you are manually doing the compiler optimization. It is logical that you don't see a gain. — DarkZeros, Aug 08 '13 at 22:13
Why do you repeat the kk loop inside the i loop? I think "i" does not affect in the last loop and it could be taken out with some minor modifications. This could lead to 32x speed gain. — DarkZeros, Aug 08 '13 at 22:54
The "i" selects 32 blocks to be multiplied and results are accumulated on a target block. 32 blocks for each sub-matrices-row and col. Its like a parsing through a row of A and a col of B. — huseyin tugrul buyukisik, Aug 09 '13 at 11:32

score 2 · Accepted Answer · answered Aug 08 '13 at 22:47

Why so many fences? In fact I think you do not even need them at all. You only need a fence when a thread write to local will be readen by other thread. Not when that thread read and write to his local memory.

BTW fences are much better than barriers. In a barrier you force the threads to be in sync. This kills the performance in some cases.

I think you can rewrite your code to gain quite a lot in speed by changing the memory access model.

You can try if this works better (I made many obvious optimizations, without knowing what your code even is doing):

__kernel void squareGpuMatrixMul(__global float * v1, __global float * v2, __global float * v3) 
{
    int localRow = get_local_id(0); 
    int localCol = get_local_id(1);  
    int selectRowFromA = get_group_id(0)*32;     
    int selectColFromB = get_group_id(1)*32;     
    int lid= localCol*16+localRow; 
    __local float Lcache1[ 16][ 16]; 
    __local float Lcache2[ 16][ 16]; 
    __local float Lcache3[ 16][ 16]; 

    __local float Lcache1a[ 16][ 16]; 
    __local float Lcache2a[ 16][ 16]; 
    __local float Lcache3a[ 16][ 16]; 

    __local float Lcache1b[ 16][ 16]; 
    __local float Lcache2b[ 16][ 16]; 
    __local float Lcache3b[ 16][ 16]; 

    __local float Lcache1c[ 16][ 16]; 
    __local float Lcache2c[ 16][ 16]; 
    __local float Lcache3c[ 16][ 16]; 

    float tmp0=0.0f; 
    float tmp1=0.0f; 
    float tmp2=0.0f; 
    float tmp3=0.0f; 

    float tmp4=0.0f; 
    float tmp5=0.0f; 
    float tmp6=0.0f; 
    float tmp7=0.0f; 

    float sumPatch=0.0f; 
    float sumPatcha=0.0f; 
    float sumPatchb=0.0f; 
    float sumPatchc=0.0f; 
    float sumPatch2=0.0f; 
    float sumPatcha2=0.0f; 
    float sumPatchb2=0.0f; 
    float sumPatchc2=0.0f; 

    Lcache3[localRow][localCol]=0.0f; 
    Lcache3a[localRow][localCol]=0.0f; 
    Lcache3b[localRow][localCol]=0.0f; 
    Lcache3c[localRow][localCol]=0.0f; 
    for(int i=0;i<1024;i+=32)  // this is A's row and B's column parsed by sub-matrices
    { 
        Lcache1[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024];
        Lcache2[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024];
        Lcache1a[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024+ 16];
        Lcache2a[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024+ 16];
        Lcache1b[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024+16384];
        Lcache2b[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024+16384];
        Lcache1c[localCol][localRow]=v1[selectRowFromA*1024+i+localCol+localRow*1024+ 16+16384];
        Lcache2c[localRow][localCol]=v2[selectColFromB*1024+i+localRow+localCol*1024+ 16+16384];
        mem_fence(CLK_LOCAL_MEM_FENCE);  
        sumPatch=0.0f; 
        sumPatcha=0.0f; 
        sumPatchb=0.0f; 
        sumPatchc=0.0f; 
        sumPatch2=0.0f; 
        sumPatcha2=0.0f; 
        sumPatchb2=0.0f; 
        sumPatchc2=0.0f; 
        for(int kk=0;kk< 16;kk++) //this is sub-matrix multiplication
        {   
            tmp0=Lcache1[kk][localRow];  // row-major
            tmp1=Lcache1a[kk][localRow]; // accesses
            tmp2=Lcache1b[kk][localRow]; //to local memory
            tmp3=Lcache1c[kk][localRow]; 
            tmp4=Lcache2[kk][localCol]; 
            tmp5=Lcache2a[kk][localCol]; 
            tmp6=Lcache2b[kk][localCol]; 
            tmp7=Lcache2c[kk][localCol]; 
            sumPatch+=tmp0*tmp4; 
            sumPatcha+=tmp0*tmp6; 
            sumPatchb+=tmp2*tmp4; 
            sumPatchc+=tmp2*tmp6; 
            sumPatch2+=tmp1*tmp5; 
            sumPatcha2+=tmp1*tmp7; 
            sumPatchb2+=tmp3*tmp5; 
            sumPatchc2+=tmp3*tmp7; 
        } 
        Lcache3[localRow][localCol]+=sumPatch+sumPatch2; 
        Lcache3a[localRow][localCol]+=sumPatcha+sumPatcha2; 
        Lcache3b[localRow][localCol]+=sumPatchb+sumPatchb2; 
        Lcache3c[localRow][localCol]+=sumPatchc+sumPatchc2; 
    } 
    mem_fence(CLK_LOCAL_MEM_FENCE); 
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024]=Lcache3[localRow][localCol];                   
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024+ 16]=Lcache3a[localRow][localCol];              
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024+16384]=Lcache3b[localRow][localCol];     
    v3[selectRowFromA*1024+selectColFromB+localCol+localRow*1024+ 16+16384]=Lcache3c[localRow][localCol];     

}

I agree with you that there are too many useless barriers, but in your code the first mem_fence should be replaced by a barrier. And IMO the second mem_fence is useless. Also, this statement "You only need a fence when a thread write to local will be readen by other thread" is incorrect too. In that case, you need to synchronize your threads, hence a barrier. Also, "fences are much better than barriers" doesn't make sense: fences orders loads/stores at the **workitem level** barriers sync threads at **workgroup level**. they are 2 different things, **they are not interchangeable**. — CaptainObvious, Aug 09 '13 at 09:25
Performance decreased by ~100GFlops when I commented out as in your example. — huseyin tugrul buyukisik, Aug 09 '13 at 11:25
+1 for the unnecessary barriers anyway. But those fences seemed to be making coalesced access. The outermost barriers were too many, decreased them. — huseyin tugrul buyukisik, Aug 09 '13 at 11:31
Deleting outermost barriers improved 0.1ms, this is nearly 20GFlops. Thanks. But main problem is my card has 32 channels and I am using 16 of them in the innermost loop. I will try 64x16 block later but that will have bugs without too much attention. — huseyin tugrul buyukisik, Aug 09 '13 at 11:50
The key is how to access the data. I think you can switch to private memory, since using local all the time is not really a good choise. Could you write your algorithm in plain C? This way we could see how to optimally implement it. For example Lcache3 is only used by each thread, use private memory there. — DarkZeros, Aug 09 '13 at 13:37
@CaptainObvious I agree what you said. But when the standard says that fences orders the workitem load/stores, it does it for all the workitems that pass that fence. This will act perfectly if all the threads will go in sync. However this is unsure and then a barrier is needed. So, yes, a barrier is needed in this case. However fences are more efficient if they can be used. — DarkZeros, Aug 09 '13 at 14:01
Lcache3 is written by 16x16 threads so making 1 private for each would decrease local-accesses by (1/16 inner loop iterations)*(1/8 reads per iteration) that would release some local space which is very important for me, thank you, now I can continue to 48x48 blocks which must give %20-%30 performance at least. — huseyin tugrul buyukisik, Aug 09 '13 at 14:46
@DarkZeros OMG!!! 2.9ms now this is 740GFLOPS with just private result instead of local for same 32x32 patch. Surely will go over 900GFLOPS after 48x48. This increase because of increased occupation because less locals per thread wave ? — huseyin tugrul buyukisik, Aug 09 '13 at 15:36
The private memory is much faster than local memory. And if you don't use it it will sit there unused. However if you abuse of private memory, the compiler will automatically jump to local/global memory if needed. So, as long as you can, use private memory. See if you can move something more to private. — DarkZeros, Aug 09 '13 at 15:52
@DarkZeros thank you, your suggestion let me do 48x48 blocks to get 981 GFlops/s . Still trying to replace one of 9 16x16 patches by private. — huseyin tugrul buyukisik, Aug 09 '13 at 17:18

Matrix multiplication on GPU. Memory bank conflicts and latency hiding

1 Answers1

Linked