-2

I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100 times. This could be most probably attributed to some automatic detection of sparsity and suitable format conversion by cblas_sgemm function.

Unfortunately, no such behavior is exhibited by its cuda counterpart i.e. cublasSgemm.

So, the question is, how can I get the same type of optimization on cublasSgemm for matrices that may have a large number of zeros.

And what technique does cblas_sgemm use to automatically adjust to sparse matrices?

Please, do not recommend cuSparse / CUSP etc because

  1. I am not sure about the sparsity of input matrices beforehand
  2. I am working on an iterative algorithm where for initial few iterations the matrices may be sparse but gradually become dense as time goes on.

Thanks in advance

Edited to include code to reproduce the above scenario

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
using namespace std;

int main()
{
const int m = 5000; 

timespec blas_start, blas_end, cublas_start, cublas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m

// input martix A: every 32nd element is non-zero
for(i = 0; i < m; i++)
{       
    A[i] = 0;
    if( i % 32 == 0)    //adjust for sparsity
        A[i] = i;
}

// input matrix B: identity matrix
// col major = row major
for(i = 0; i < m; i++)
    for(j = 0; j < m; j++)
    {
        if (i==j)           
            B[j*m + i] = 1;
        else
            B[j*m + i] = 0;     
    }

clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1, A, m, B, m, 0, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
    /*
for(i = 0; i < 12; i++)
    printf("%f ", C[i]);
    */

//cublas section

cudaError_t cudaStat;   
cublasHandle_t handle;
cublasCreate(&handle);
//Declaring Device Variables
float *A_d, *B_d, *C_d;

//Allocating Memory for Device Variables
cudaStat = cudaMalloc(&A_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for A_d\n");

cudaStat = cudaMalloc(&B_d, sizeof(float)*m*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for B_d\n");

cudaStat = cudaMalloc(&C_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for C_d\n");

// Moving values of A, B onto Device variables
cublasSetVector(m, sizeof(float), A, 1, A_d, 1);
cublasSetMatrix(m, m, sizeof(float), B, m, B_d, m); 

// Do the actual multiplication
float alpha = 1.0f, beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_start);

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, m, m, &alpha, A_d, 1, B_d, m, &beta, C_d, 1);

cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_end);

cublasGetVector(m, sizeof(float), C, 1, C_d, 1);
    /*
for(i = 0; i < 12; i++)
    printf("%f ", C[i]);
    */

// Print times
// blas time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"BLAS Time = "<< totaltime << "\n";

//cublas
totalsec = (double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"CUBLAS Time = "<< totaltime << "\n";

return 0;
}

Ran it to get the following results

malang@ubuntu:~/uas/stackoverflow$ nvcc -arch=sm_12 blascomp.cu -o blascomp.o -lblas -lcublas
malang@ubuntu:~/uas/stackoverflow$ ./blascomp.o
BLAS Time = 0.000964504
CUBLAS Time = 0.0365322

EDIT

Edited after the answer of @Eric

Use of cublasSgemv has greatly enhanced the performance on the GPU. But, I still have this problem of cblas_sgemm being much more efficient for sparse matrices on the CPU. What could be the possible reasons?

EDIT Executed the following commands on the suggestion of @Eric @osgx @Robert Crovella

erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
    linux-gate.so.1 =>  (0xb76f6000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)

erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root   26 مارچ  13  2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ  13  2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root   28 مارچ  13  2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root   30 مارچ  13  2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root   32 مارچ  13  2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf
talonmies
  • 70,661
  • 34
  • 192
  • 269
malang
  • 33
  • 9

1 Answers1

3

Your code has a problem - you are using the wrong BLAS API. You use the matrix-matrix-multiplication routine gemm() to do a vector-matrix-multiplication operation.

For vec-mat-mul or mat-vec-mul you should use gemv(). Of course gemm() can give correct result with a matrix having only 1 row. But this is an unexpected corner case that gemv() should handle, so you may not get the peak performance on GPU and/or CPU.

You could change to gemv() and benchmark again.


EDIT

Here's my benchmark result with single thread MKL. Values of A and B are same as in your code. I cannot reproduce your result of '0.000964504s' on CPU. You could check the correctness of your result vector. There's a chance that your cblas library has a bug.

Using gemm()

BLAS Time = 0.0169784
CUBLAS Time = 0.00356155

Using gemv()

BLAS Time = 0.0167557
CUBLAS Time = 0.0013809

EDIT2

I now can reproduce the FAST result on unbuntu 14.04 with the package libblas-dev.

The reason is answered in the following question.

cblas gemm time dependent on input matrix values - Ubuntu 14.04

In the particular version of BLAS, there's code to check for zero element. The checking cost is O(n^2), so it is worth doing this on matrix-matrix multiplication whose cost is O(n^3).

For GPU gemm(), as the order of computing is different (block by block instead of line by line), such optimization may not be feasible. But it is doable for GPU gemv(), where the time spent on loading the matrix from global memory could be saved.

Community
  • 1
  • 1
kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • Replaced with "cublasSgemv(handle, CUBLAS_OP_N, m, m, &alpha, B_d, m, A_d, 1, &beta, C_d, 1);". For dense vector A: BLAS Time = 0.0291584, CUBLAS Time = 0.00639274, which is good, BUT.... – malang Jun 25 '16 at 20:56
  • you are right blas -> cblas_sgemv(CblasRowMajor, CblasNoTrans, m, m, 1, B, m, A, 1, 0, C, 1); and cublas -> cublasSgemv(handle, CUBLAS_OP_N, m, m, &alpha, B_d, m, A_d, 1, &beta, C_d, 1); Results: BLAS Time = 0.0300957, CUBLAS Time = 0.00634591 – malang Jun 25 '16 at 22:26
  • You mean never replace gemv with gemm. – kangshiyin Jun 25 '16 at 22:50
  • Yes, never replace gemv with gemm. – malang Jun 25 '16 at 23:31
  • could you please verify the arguments of this call to be correct for A=1xm, B=mxm and C=1xm? cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m); I mean could it be possible that I might be accessing out of bounds memory? – malang Jun 25 '16 at 23:35
  • Use of cublasSgemv has greatly enhanced the performance on the GPU. But, I still have this problem of cblas_sgemm being much more efficient for **sparse matrices**. What could be the possible reasons? – malang Jun 26 '16 at 18:55
  • @RobertCrovella Would you PLEASE run the code at http://stackoverflow.com/questions/38042451/cblas-gemm-time-dependent-on-input-matrix-values?noredirect=1#comment63524988_38042451 just once. I will be EXTREMELY GRATEFUL. Please, could you give a few hints for the possible reasons for this bizarre behaviour. I have tested this code on another computer - same ubuntu flavour - with the same results – malang Jun 27 '16 at 15:53
  • Done, commented there. I can't explain the behavior or offer any suggestions. As Eric has already suggested, it would be useful to know the actual blas that is being used in your case. – Robert Crovella Jun 27 '16 at 16:13
  • @RobertCrovella edited questions to add some more info – malang Jun 27 '16 at 19:03
  • I took a look at the default blas implementation in Ubuntu 14.04 and it does have an optimization for the cases where certain matrix elements are exactly zero. I've updated your other question with an answer detailing my findings. So I retracted my previous statement (comment). – Robert Crovella Jun 28 '16 at 04:36
  • 'But it is doable for GPU gemv(), where the time spent on loading the matrix from global memory could be saved.' could you please elaborate a bit. Do you mean that we should check each element of the vector for being '0' before calling cublas gemv? Would that save time for vectors with MANY zeros @Eric – malang Jun 28 '16 at 15:57
  • 1
    @malang No. it means you could implement the kernel function gemv() by yourself, and apply this optimization strategy to it. – kangshiyin Jun 28 '16 at 16:06
  • @Eric it takes more time just to spawn 'm' threads without any code than it takes cblas_sgemm to multiply those matrices. – malang Jun 28 '16 at 18:02
  • @malang what do you mean? where does the 'threads' thing come from? – kangshiyin Jun 28 '16 at 18:14
  • @Eric I mean I would need at least 'm' threads in cuda to implement an optimized version of cublasSgemv for this situation. But, just creating 'm' threads of a simple kernel on the GPU without any code takes more time than cblas_sgemm to multiply 'these' matrices on the host. But, I have 'grown up' under the impression that for large matrix multiplications GPUs are better. – malang Jun 28 '16 at 18:34
  • @Eric 'm' being number of elements in the vector – malang Jun 28 '16 at 18:46
  • @malang I cannot agree. Empty kernel with m threads takes less then 1e-5s usually. – kangshiyin Jun 28 '16 at 18:50
  • @Eric I shouldn't have said 'more time'. I should have said 'of the same order'. Maybe its dependent on the system. With m = 6000 with 95% 0s, cblas_sgemm takes 7.8964e-05 sec. While it takes cuda 2.3e-5 sec to create 6000 thousand threads of an empty kernel – malang Jun 28 '16 at 19:29