5

I'm implementing an algorithm that, in essence, is a series of matrix-matrix multiplications like this:

Res = M1.M2.M3. ... .Mn

My matrices are really small 100x100 floats, but the sequence is really long, in the order of billions.

I tried using CUBLAS to to the matrix multiplications but this was slow, I did however notice something interesting.

multiplying a 100x100 with a 100x100 matrix was slow, but multiplying a 1.000.000x100 with a 100x100 was relatively fast, this made me think .If I instead of having a scan from left to right had 10.000 scans in parallel. This should be pretty fast, and if I multiplied my matrices when I was done with this, I would get the same result -- just faster.

Res1 = M1.M2.M3. ... .Mn/1000-1
Res1 = M1+n/1000.M2+n/1000.M3+n/1000. ... .M2(n/1000)-1
...
Res1  = M1+999*n/1000.M2+999*n/1000.M3+999*n/1000. ... .M1000*(n/1000)-1
Res = Res1*Res2* ... *Res999 

Its worth nothing that M_1 ... M_n are in a set of about 100 different matrices, so space consumption isn't really a problem, all I need to to is be to do multiple multiplies in one operation.

Now here is my problem. I've done a matrix-matrix(sgemm) implementation inspired by the one nvidia demonstrates in their documentation but it is an order of about 4 times as slow as cublas. Do anyone know how CUBLAS works? And if the code is available somewhere?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Martin Kristiansen
  • 9,875
  • 10
  • 51
  • 83
  • Are the matricies special in any helpful way? If they're simultaneously diagalizable this gets a lot simpler. – Jonathan Dursi Feb 10 '12 at 13:08
  • @JonathanDursi: The only special characteristics of the matrix'es is that for each of the matrix'es all the values sum to 1. and the matrix'es are quadratic, but that should be clear from the description. – Martin Kristiansen Feb 10 '12 at 13:19

3 Answers3

12

Have you looked at the latest CUBLAS (version 4.1)? It includes a new batched GEMM mode specifically intended for large batches of small matrix-matrix multiplies. I would suggest doing a pairwise multiplication tree as Jonathan Dursi suggested in his answer, using the CUBLAS batched API to accelerate it, rather than writing your own custom kernel as he suggests.

CUBLAS 4.1 is included with the CUDA Toolkit v4.1.

CUBLAS BATCHED GEMM API IMPROVES PERFORMANCE OF BATCHES OF SMALL MATRICES

harrism
  • 26,505
  • 2
  • 57
  • 88
  • I wonder what the limitations are to this call, on my macbookpro (256M nv 330M) it simply falls into an endless loop if I try to run 10.000 112x112 multiplications in one go). no error, no nothing. but for smaller amounts it works like a charm :-) – Martin Kristiansen Feb 15 '12 at 15:29
  • 2
    @MartinKristiansen: 112x112x10,000x3*4 is well over 1GB of memory required. You have 256MB. I'm not sure what you mean by "endless loop", but it sounds to me like you aren't properly checking the error codes returned from cudaMalloc(). – harrism Feb 16 '12 at 01:43
  • @harrism are those matrices single or double precision? – avgn Jul 06 '18 at 19:24
2

The problem is that cublas, etc are designed for using all of the SMs to multiply large matrices. That's not what you want; you want to do lots of little matrix multiplications.

There may be some way to cast this into something CUBLAS could do well for you, but I'm not seeing it. My suggestion would be the following:

Write a kernel that uses one thread block to multiply two of your small matrices, and output the result.

Then launch the kernel log2N with tonnes of blocks and tackle the multiplication pairwise:

  • Step 1: multiply M1 x M2, M3 x M4 ... MN - 2 x MN-1, outputting M'1,M'2.. M'N/2
  • Step 2: multiply M'1 x M'2, M'3 x M'4 ... M'N/2 - 2 x MN/2-1, outputting M''1,M''2.. M''N/4...

etc.

There'll be a factor of 50% memory overhead, but I think you'll make better use of your cores this way.

Update

Ok, if you really don't want to do this in stages, you could do it this way but it'll require more coding, and performance will probably be worse compared to what you could get with something like cuBLAS and asynchronous transfer. I'm assuming you're using a Fermi, and you've turned off L1 cache so you have 48K shared mem per SM.

Store the 100 matrices in 2x2 block form, with each block contiuous in memory. So matrix[matrixnum,i,j] starts at matricies[matrixnum*100*100 + i*100*50 + j*50*50]. Note that each block is 50*50*4 bytes ~ 10K, so 4 comfortably fit in shared memory.

Assign each 4 threadblock an (Nmatricies/Nblocks) long chain of the matrices to multiply, with one of the four being responsible for each block of the multiplication.

Let's say you're threadblock 1 of 4 and the first of the matricies you're to multiply is AxB. You're responsible for (1,1) of the result - (AB)1,1 = A1,1 B1,1+ A1,2*B2,1. You pre-load in A1,1 into myblock[0] in shared memory.

  • load in myblock[1] = B1,1 from global memory
  • myblock[3] = myblock[0] * myblock[1] (matrix mult, all in shared memory)
  • load in myblock[1] = A1,2 from global
  • load in myblock[2] = B2,1 from global
  • myblock[0] = myblock[3] + (myblock[1] * myblock[2]) (matrix mult and addition, all in shared memory).

Now you can repeat this for the rest of the sequence of matrices in your part of the chain, outputting only when done.

When you're done, you'll end up with (#SMs) matricies in global memory, which still have to be multiplied, but there won't be any additional temporary storage in global memory, and you won't have had to copy data into global memory other than the original matricies and the lists of which ones to tackle.

Again, there's no real reason to do this except that you can't be bothered to ship data to the GPU in stages, and performance will almost certainly be worse; there's fewer global memory writes, but you'll probably pay for that with a hand-rolled GEMM. The good news is that 50 isn't a multiple of 8, so you probably won't have too much in the way of shared memory bank conflicts.

Again, for bonus points, you can precompute all the blocks of all pairwise matrix products first and then half the length of your list.

Jonathan Dursi
  • 50,107
  • 9
  • 127
  • 158
  • As I stated the matrixes, sized at about 50K, are in a set of only abou 100 different matrices, making the memory demand very small. What you are surgesting will mean using memory linear to the length of my sequence of matrices, since the sequence is extreamly long in the order of billion this will not be feasible. – Martin Kristiansen Feb 10 '12 at 10:18
  • You're still going to have intermediate products; you can make those larger or smaller by doing (say) three- or four-way multiplications, and only deal with a subset of the matricies at one time, but the problem will remain. The good news is that @harrism pointed out you won't actually have to write the kernels; I hadn't known about the new batched operations, that's pretty cool. – Jonathan Dursi Feb 10 '12 at 12:39
  • you are right, but the size of my intermedia resultats need not be that big. The impl you describe takes up something like `sizeof(Matrix)*n/2` and since my matrices take up something like 40K and the sequence is 2.000.000.000 matrices, this becomes something like 37TB in the first step, that is more mem than my card has. The result can be chopped up into smaller peices, but the issue remans, that this implementation will have a large amound og memory access. And a crazy mem usage. And yes the batched thing is awesome :-) – Martin Kristiansen Feb 10 '12 at 12:50
  • Right, so you don't store them all at once. You start with almost as many as will fit in memory, and crunch them down pair (or whatever) wise to one (or, probably better, a small number). Then you bring over (or, more likely, copy) the next bunch. You don't have to (can't) have everything in memory at once. But any xGEMM library (like CUBLAS) is going to need to have those matricies sitting there in memory to multiply, so they will need to be there at some point - you can't avoid that. – Jonathan Dursi Feb 10 '12 at 12:52
  • You might be able to save yourself some time, given the number of matricies, by doing a precomputing step; doing all possible 100^2 pairwise matrix multiplications ahead of time will only require 1.2GB, and then you'll be able to sequence of initial matricies from ~1billion to half that, which is a pretty good savings. But then you'd have to keep the 1/3 GB (as opposed to ~4MB) of pairwise results sitting there in memory for the rest of the steps... – Jonathan Dursi Feb 10 '12 at 12:59
  • Ok, I might have been unclear on my impl. My algo. goes something like this: I have an array os matrixes `Matrix m[100]` and the have to calculate `Matrix res = m[1]*m[4]*m[some_idx]...m[3];` does this make sense? hence I don't really take up that much mem. At this point I have to admit that I actually implemented what you surgested ;-) – Martin Kristiansen Feb 10 '12 at 13:00
  • I agree your initial data doesn't take up much room (although note that it is linear in the ~billion matricies, just down by a factor of 40K). But you still have to do those matrix mults. – Jonathan Dursi Feb 10 '12 at 13:04
  • See update. I think it meets your requirements, but I think it's a worse solution all around. – Jonathan Dursi Feb 10 '12 at 18:21
0

LIBXSMM - a library targeting Intel Architecture for small, dense or sparse matrix multiplications, and small convolutions is exactly meant to exploit best performance for small matrix multiplications.

In contrast to NVidia CUBLAS (or Intel MKL), LIBXSMM does not rely on a batch interface. Instead, one can arrange for individual calls and also supply "next locations" i.e., where the operands/matrices of the next multiplications are located (in memory). The advantage is that an explicit data structure or index format describing the batch is not needed.

#include <libxsmm.h>

int main()
{
  const libxsmm_gemm_prefetch_type prefetch = LIBXSMM_PREFETCH_AUTO;
  const double alpha = 1.0, beta = 1.0; /* accumulate C-matrix */
  const int m = 23, n = 23, k = 23;     /* some problem size */
  libxsmm_dmmfunction xmm = NULL;       /* function pointer */

  xmm = libxsmm_dmmdispatch(23, 23, 23, /* some problem size */
          NULL/*lda*/, NULL/*ldb*/, NULL/*ldc*/,
          &alpha, &beta, NULL/*flags*/,
          NULL/*&prefetch*/);

  if (xmm) { /* JiT'ted code has been generated */
#   pragma omp parallel for
    for (int i = 0; i < nbatch; ++i) {
      const double *const ai = a + i * asize;
      const double *const bi = b + i * bsize;
      /* e.g., matrix C is accumulated (instead of streamed) */
      double *const ci = c /*+ i * csize*/;
      /* optionally provide "next locations" */
      xmm(ai, bi, ci/*,
          ai + 1 * asize,
          bi + 1 * bsize,
          ci + 0 * csize
      */);
    }
  }
}

LIBXSMM produces highly optimized and specialized code (JiT), which exploits latest instruction set extensions (SSE3, AVX, AVX2, and AVX-512). LIBXSMM is available under a non-permissive license (BSD-3 clause).

NOTE: This is not about CUBLAS (as originally asked for).

hfp
  • 17
  • 2