Multiplying hundreds of matrices using cuda

Question

I am writing a program which requires to multiply hundreds of matrices in parallel using CUDA. Can somebody explain how to perform this operation.

I have seen that Kepler architecture is capable of dynamic parallelism. Has somebody used this architecture and if yes, which Nvidia graphics card.

What size are the matrices? Where does the dynamic parallelism come into play? Or is that a different question? — Bart, Oct 24 '12 at 14:33
The latest CUBLAS libraries come with a batch mode for matrix multiplication which allows exactly this as long as the matrices are the same size - http://docs.nvidia.com/cuda/cublas/index.html#topic_3_6 — Jonathan Dursi, Oct 24 '12 at 14:38
[batchCUBLAS example](http://docs.nvidia.com/cuda/cuda-samples/index.html#batchcublas) and [api reference](http://docs.nvidia.com/cuda/cublas/index.html#topic_8_2). — Robert Crovella, Oct 27 '12 at 05:44

score 1 · Answer 1 · answered Oct 24 '12 at 16:04

The easiest way to get fast performing matrix multiply in parallel using CUDA is through the ArrayFire CUDA library using the GFOR loop. Here's some code that does what you want:

int n = 8, int m = 8;   // dimensions
int t = 10;             // number of different matricies
array A = randu(m,n,t); // many matricies
array B = randu(m,n);   // one matrix
array C = zeros(m,n,t); // destination

// multiply C=A*B for all A, at the same time
gfor (array i, A.dims(2)) {
    C(span,span,i) = matmul(A(span,span,i), B);
}

print( A );
print( B );
print( C );

ArrayFire automatically tiles out the computation efficiently for execution on the GPU. All that is optimized behind the scenes for you. I find it to be faster than trying to write it by hand myself.

Multiplying hundreds of matrices using cuda

1 Answers1