How much parallel speed up can I expect in matrix-vector multiplication?

Question

I have written an MPI routine to parallelize matrix-vector multiplication. The speed up has been disappointing to non-existent. I have found a lot of routines on the net, and I am handling this about the same way that most of them do. What I haven't been able to find is much data on real speed up on real machines. I am working with what I guess is a modest sized problem -- a matrix ranging in size from 100x100 to 1000x1000 and number of processors from 2 up to 64. I am decomposing the matrix in a roughly square, checkerboard fashion. Can anyone point me to any data on what kind of speed up I can realistically hope for in this range of problem size and processor number? Thanks.

My guess is that your MPI routine isn't also using the full register width. Supposedly from using SIMD instructions alone you get get 2x speedup; see this relevant but slightly outdated PDF from Intel: http://download.intel.com/design/PentiumIII/sml/24504501.pdf — Levi Morrison, Nov 15 '13 at 04:28

score 6 · Accepted Answer · answered Nov 13 '13 at 14:29

It takes 2*N^2 FP operations to multiply an N x N matrix by a vector of length N. With N equal to 1000 this results in 2.10⁶ operations. A modern CPU core executes 4 FP operations per cycle and runs at around 2.10⁹ cycles/second. Therefore it only takes 250 µs to do the matrix-vector multiplication on a single CPU core. It takes quadratically less time with smaller matrices. Now divide that time by the number of CPU cores working together.

Every parallelisation technology introduces some kind of overhead. It only makes sense to employ such technology if this overhead is substantially smaller than the amount of work being done by each processing element (= CPU core).

If you increase the matrix size, you end with up with problem that takes more time and therefore the overhead would be relatively less. But you would end up with a completely different problem - memory bandwidth. Matrix-vector multiplication is a memory bound problem and on modern CPUs the bandwidth of a single socket could easily be "eaten" by one or two threads doing the multiplication. Having more threads would do nothing since there simply won't be enough memory bandwidth to feed the threads with data. Only adding additional CPU sockets would improve the performance since it will effectively increase the available memory bandwidth.

That's it - matrix-vector multiplication is a very simple but also very tricky problem when it comes to parallelisation.

Thanks. Wasn't able to get to the site for a while to hit the accept button. — bob.sacamento, Nov 30 '13 at 16:59

How much parallel speed up can I expect in matrix-vector multiplication?

1 Answers1