Multiply one fixed matrix by a huge number of vectors

Question

I'll need to change the basis of some 10^7 vectors, each having 200 coordinates. So I will multiply one [200 x 200] matrix by 10^7 [200 x 1] vectors. I need it to run very fast but I need to code it fast (one day or less) and my CUDA is poor, so I don't want to code it from scratch in CUDA or OpenCL. Maybe some existing library can do it for me? Notice that, if the solution uses GPGPU, the matrix should be transfered to the GPU only once, otherwise the performance will be poor. Could I could use OpenACC (or OpenMP, I don't know)? Is it possible to do this in a day?

I prefer open source solutions (for both convenience and ethical reasons) but I can tolerate a closed source solution, even paid (assuming it is not too costly).

This is for my dissertation. Thank you for your attention.

Mike Dunlavey · Answer 1 · 2013-07-31T15:20:49.217

You're going to multiply 10 million big vectors by a huge matrix that is the same for all of them. It would be fastest if all possible decision-making could be compiled-out ahead of time. In other words, there are lots of index calculations and loop testing that would be identically repeated millions of times. This sounds like a perfect case for pre-compilation:

Write a small program that would take as input your 200x200 matrix data values, and have it print out a piece of program text defining a function capable of inputting the input vector and outputting the result vector. It could look something like this:

void multTheMatrixByTheVector(double a[200], double b[200]){
  b[0] = 0
    + a[0] * <a constant, the value of mat[0][0]>
    + a[1] * <a constant, the value of mat[1][0]>
    ...
    + a[199] * <a constant, the value of mat[199][0]>
    ;
  b[1] = 0
    + a[0] * <a constant, the value of mat[0][1]>
    + a[1] * <a constant, the value of mat[1][1]>
    ...
    + a[199] * <a constant, the value of mat[199][1]>
    ;
  ...
  b[199] = etc. etc.
}

You see, that function will be around 40000 lines long, but a decent compiler should be able to handle it. Of course, if any of the matrix elements are zero, i.e. there's some sparsity, you can omit those lines (or let the compiler optimizer do it). To do this on CUDA or vectorized instructions, you'd have to modify it accordingly, but that should be do-able.

When you include that function in your main program, it should be able to run about as fast as the machine can go. It's not wasting any cycles doing index calculations, loop testing, or multiplying by empty matrix cells.

Then if it takes 10ns per multiply and add, my back-of-the envelope says it should take 400 usec per vector, or 4000 seconds overall - a little over an hour.

score 1 · Accepted Answer · answered Jul 30 '13 at 11:38

You can put your vectors in a matrix, 200 * 10^7 is perhaps to much space at once depending on our system, so you can split it. And then you use any code that is optimized for matrix matrix multiplication, like BLAS. There are many implementations on CPUs, GPUs (cuBLAS, MAGMA,...), multicores (PLASMA,...), or distributed memory. Since you will have big matrices you vill have a better acceleration than by doing matrix vector multiplications.

Multiply one fixed matrix by a huge number of vectors

2 Answers2