I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100 times. This could be most probably attributed to some automatic detection of sparsity and suitable format conversion by cblas_sgemm function.
Unfortunately, no such behavior is exhibited by its cuda counterpart i.e. cublasSgemm.
So, the question is, how can I get the same type of optimization on cublasSgemm for matrices that may have a large number of zeros.
And what technique does cblas_sgemm use to automatically adjust to sparse matrices?
Please, do not recommend cuSparse / CUSP etc because
- I am not sure about the sparsity of input matrices beforehand
- I am working on an iterative algorithm where for initial few iterations the matrices may be sparse but gradually become dense as time goes on.
Thanks in advance
Edited to include code to reproduce the above scenario
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start, blas_end, cublas_start, cublas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
// input martix A: every 32nd element is non-zero
for(i = 0; i < m; i++)
{
A[i] = 0;
if( i % 32 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
// col major = row major
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
{
if (i==j)
B[j*m + i] = 1;
else
B[j*m + i] = 0;
}
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1, A, m, B, m, 0, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
/*
for(i = 0; i < 12; i++)
printf("%f ", C[i]);
*/
//cublas section
cudaError_t cudaStat;
cublasHandle_t handle;
cublasCreate(&handle);
//Declaring Device Variables
float *A_d, *B_d, *C_d;
//Allocating Memory for Device Variables
cudaStat = cudaMalloc(&A_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for A_d\n");
cudaStat = cudaMalloc(&B_d, sizeof(float)*m*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for B_d\n");
cudaStat = cudaMalloc(&C_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for C_d\n");
// Moving values of A, B onto Device variables
cublasSetVector(m, sizeof(float), A, 1, A_d, 1);
cublasSetMatrix(m, m, sizeof(float), B, m, B_d, m);
// Do the actual multiplication
float alpha = 1.0f, beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_start);
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, m, m, &alpha, A_d, 1, B_d, m, &beta, C_d, 1);
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_end);
cublasGetVector(m, sizeof(float), C, 1, C_d, 1);
/*
for(i = 0; i < 12; i++)
printf("%f ", C[i]);
*/
// Print times
// blas time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"BLAS Time = "<< totaltime << "\n";
//cublas
totalsec = (double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"CUBLAS Time = "<< totaltime << "\n";
return 0;
}
Ran it to get the following results
malang@ubuntu:~/uas/stackoverflow$ nvcc -arch=sm_12 blascomp.cu -o blascomp.o -lblas -lcublas
malang@ubuntu:~/uas/stackoverflow$ ./blascomp.o
BLAS Time = 0.000964504
CUBLAS Time = 0.0365322
EDIT
Edited after the answer of @Eric
Use of cublasSgemv has greatly enhanced the performance on the GPU. But, I still have this problem of cblas_sgemm being much more efficient for sparse matrices on the CPU. What could be the possible reasons?
EDIT Executed the following commands on the suggestion of @Eric @osgx @Robert Crovella
erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)
erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root 26 مارچ 13 2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ 13 2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28 مارچ 13 2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30 مارچ 13 2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32 مارچ 13 2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf