The problem is that cublas, etc are designed for using all of the SMs to multiply large matrices. That's not what you want; you want to do lots of little matrix multiplications.
There may be some way to cast this into something CUBLAS could do well for you, but I'm not seeing it. My suggestion would be the following:
Write a kernel that uses one thread block to multiply two of your small matrices, and output the result.
Then launch the kernel log2N with tonnes of blocks and tackle the multiplication pairwise:
- Step 1: multiply M1 x M2, M3 x M4 ... MN - 2 x MN-1, outputting M'1,M'2.. M'N/2
- Step 2: multiply M'1 x M'2, M'3 x M'4 ... M'N/2 - 2 x MN/2-1, outputting M''1,M''2.. M''N/4...
etc.
There'll be a factor of 50% memory overhead, but I think you'll make better use of your cores this way.
Update
Ok, if you really don't want to do this in stages, you could do it this way but it'll require more coding, and performance will probably be worse compared to what you could get with something like cuBLAS and asynchronous transfer. I'm assuming you're using a Fermi, and you've turned off L1 cache so you have 48K shared mem per SM.
Store the 100 matrices in 2x2 block form, with each block contiuous in memory. So matrix[matrixnum,i,j]
starts at matricies[matrixnum*100*100 + i*100*50 + j*50*50]
. Note that each block is 50*50*4 bytes ~ 10K, so 4 comfortably fit in shared memory.
Assign each 4 threadblock an (Nmatricies/Nblocks) long chain of the matrices to multiply, with one of the four being responsible for each block of the multiplication.
Let's say you're threadblock 1 of 4 and the first of the matricies you're to multiply is AxB. You're responsible for (1,1) of the result - (AB)1,1 = A1,1 B1,1+ A1,2*B2,1. You pre-load in A1,1 into myblock[0] in shared memory.
- load in myblock[1] = B1,1 from global memory
- myblock[3] = myblock[0] * myblock[1] (matrix mult, all in shared memory)
- load in myblock[1] = A1,2 from global
- load in myblock[2] = B2,1 from global
- myblock[0] = myblock[3] + (myblock[1] * myblock[2]) (matrix mult and addition, all in shared memory).
Now you can repeat this for the rest of the sequence of matrices in your part of the chain, outputting only when done.
When you're done, you'll end up with (#SMs) matricies in global memory, which still have to be multiplied, but there won't be any additional temporary storage in global memory, and you won't have had to copy data into global memory other than the original matricies and the lists of which ones to tackle.
Again, there's no real reason to do this except that you can't be bothered to ship data to the GPU in stages, and performance will almost certainly be worse; there's fewer global memory writes, but you'll probably pay for that with a hand-rolled GEMM. The good news is that 50 isn't a multiple of 8, so you probably won't have too much in the way of shared memory bank conflicts.
Again, for bonus points, you can precompute all the blocks of all pairwise matrix products first and then half the length of your list.