matrix multiplication using cuBLAS on alea gpu

Question

I'm trying to use Gemm for matrix multiplication on Alea GPU, however, this code gives the wrong result.

Gpu gpu = Gpu.Default;
Blas blas = new Blas(gpu);

int m=2,n=3;    //in dimension and out dimension (output will be mxn matrix)
int k=4;

//column major
float[,] A = new float[4,2] { {100,200},{2,6},{3,7},{4,8} };    //2x4 matrix
float[,] B = new float[3,4] { {1,4,7,10}, {2,5,8,11}, {3,6,9,12} }; //4x3 matrix
float[,] C = new float[3,2] { {-1,-1}, {-1,-1}, {-1,-1}  }; //2x3 matrix

var dA = gpu.AllocateDevice<float>(A);  
var dB = gpu.AllocateDevice<float>(B);  
var dC = gpu.AllocateDevice<float>(C);

blas.Gemm(Operation.N,Operation.N,m,n,k,1f,dA.Ptr,m,dB.Ptr,k,0f,dC.Ptr,m);

var result = Gpu.Copy2DToHost(dC);

This is the result I get. It just copies some number from matrix A. Some numbers in matrix C do not change from the initialization.

100 -1 -1
200 -1 -1

Is there anything wrong with the code? Please help.

I'm using alea 3.0.3 with cuda toolkit 8.0.

UPDATE1: I've found that it gives correct result when I flatten A,B,C matrices to 1D-arrays. However, still want to know what's wrong with 2D-arrays.

score 1 · Answer 1 · answered Sep 05 '17 at 11:05

I've found that gpu.AllocateDevice for 2D-Array does not allocate the space on GPU as it is on CPU. The distance between the first elements of any 2 consecutive columns (pitch) is surprisingly large.

Therefore, the leading dimension parameter must be changed.

blas.Gemm(Operation.N,Operation.N,m,n,k,1f,dA.Ptr,dA.PitchInElements.ToInt32(),dB.Ptr,dB.PitchInElements.ToInt32(),0f,dC.Ptr,dC.PitchInElements.ToInt32());

Now, I got the correct result. However, is there any documents showing the details of how the allocation of 2D-array on GPU really works in Alea?

I can only see http://www.aleagpu.com/release/3_0_3/api/html/6f0dc687-7191-91ba-6c30-bb379dded567.htm which has no explanation.

Most likely, it is using [cudaMallocPitch](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g32bd7a39135594788a542ae72217775c). The reason for the pitch is to align matrix rows with physical memory channels for better performance in some kernels. — Aleksandr Dubinsky, Nov 10 '17 at 14:06

matrix multiplication using cuBLAS on alea gpu

1 Answers1