How to write fast code for matrix multiplication in C++?

Question

I wrote some C++ code for matrix multiplication. I used vector<double> to save the matrix entries, and I used a series of 3 nested for loops to calculate the multiplication entry-by-entry. It turns out that this is super slow (for a 900*500 and 500*500 matrix multiplication, it takes about 10 seconds on my macbook air). What is the reason? Did I use a bad representation of matrix or there are big flaws in the code?

    for (int c_b=0;c_b<B.n_c;c_b++)
    {
        vector<double> vtmp(A.n_r);
        for (int r_a=0;r_a<A.n_r;r_a++)
        {
            sum=0;
            for (int i=0;i < A.n_c;i++)
            {
                sum=sum+A.mat[r_a+i*A.n_r]*B.mat[i+c_b*B.n_r];
            }
            vtmp[r_a]=sum;
        }
        Cvv[c_b]=vtmp;
    }

UPDATE: This issue has been resolved by using subroutines in Lapack.

Instead of creating `vtmp` how about just doing `Cvv[c_b][r_a] = sum;` — SirGuy, Oct 04 '16 at 18:48
I would suggest using one of the many [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) implementations out there. — Robert Prévost, Oct 04 '16 at 18:49
Use a profiler to see where your code is spending most of its time vs. a known fast implementation. — Gillespie, Oct 04 '16 at 18:50
You can check answer from here http://stackoverflow.com/questions/4455645/what-is-the-best-matrix-multiplication-algorithm — krishnakant, Oct 04 '16 at 18:53
I don't see a `vector>` in your code, nor declarations for `A`, `B`, or `Cvv`, nor details of their types. I'm inclined to say that your matrix representation is inefficient, but that would be speculative when I don't actually have that representation to evaluate. — John Bollinger, Oct 04 '16 at 18:54
Also if you could tell us the sizes of your matrices and how long "super slow" actually corresponds to in terms of running time vs. how fast you want it to go — SirGuy, Oct 04 '16 at 18:54
Do note, however, that accessing the `double`s inside a `vector>` can be expected to be substantially slower than accessing the elements of a plain 1D or 2D array (directly), and it affords much less opportunity for optimization. — John Bollinger, Oct 04 '16 at 18:58
Try setting the compiler optimizations for high-speed and running. Also, don't execute the program in debug mode. — Thomas Matthews, Oct 04 '16 at 19:03
Move your `vtmp` declaration to before the first `for` loop. — Thomas Matthews, Oct 04 '16 at 19:04
Your program could be suffering from `data cache misses`. Design your matrix accesses so they stay within a cache line. — Thomas Matthews, Oct 04 '16 at 19:07
@ThomasMatthews I tried GuysGreer's suggestion and it did not improve much. I guess your suggestion has the same motivation as his. — Resorter, Oct 04 '16 at 19:12
@MarianSpanik I tried GuysGreer's suggestion and it did not improve much. I guess your suggestion has the same motivation as his. — Resorter, Oct 04 '16 at 19:12
Can this algorithm be changed so that `A.mat` and `B.mat` can be accessed in a more predictable or easier manner? — Thomas Matthews, Oct 04 '16 at 19:30

Thomas Matthews · Answer 1 · 2016-10-04T20:04:55.437

Here are some suggestions for performance improvement.

Move Vector outside the loop
Creation of vectors requires time. Move the declaration before any of the for loops:

vector<double> vtmp(A.n_r);
for (int c_b=0;c_b<B.n_c;c_b++)
{
    for (int r_a=0;r_a<A.n_r;r_a++)
    {
        //...
    }
}

Dissect the assignment calculation.
Without any profiling or benchmarking, the assignment statement looks like it takes up the most time. Break it apart into separate steps to help the compiler and so you can see if the calculation can be performed more optimially.
Original:

sum=sum+A.mat[r_a+i*A.n_r]*B.mat[i+c_b*B.n_r];

Dissected[1]:

const int A_Index = r_a + i * A.n_r;
const int B_Index = i + c_b*B.n_r;
sum = sum + A.mat[A_Index] * B.mat[B_Index];

Dissected[2] (using more variables):

const int temp1 = i * A.n_r;
const int temp2 = c_b * B.n_r;
const int A_Index = r_a + temp1;
const int B_Index = i + temp2;
sum = sum + A.mat[A_Index] * B.mat[B_Index];

The above may assist the compiler in choosing the optimal processor instructions.

Using local variables
Ideally you want to have the processor fetch as many sequential locations from the matrix, while it is in the data cache before it reloads. Something like this:

int ATemp1 = A[w];
int ATemp2 = A[x];
int ATemp3 = A[y];
int ATemp4 = A[z];

int BTemp1 = B[e];
int BTemp2 = B[f];
int BTemp3 = B[g];
int BTemp4 = B[h];

sum = sum + ATemp1 * BTemp1;
sum = sum + ATemp2 * BTemp2;
sum = sum + ATemp3 * BTemp3;
sum = sum + ATemp4 * BTemp4;

There are two issues that I see that are hindering performance. The first is the calculations for the vector indices and second is that the two indices may be falling outside the processor's cache. The processor may be wasting time reloading the data cached when calculating the sum. — Thomas Matthews, Oct 04 '16 at 19:28
Your comment "The processor may be wasting time reloading the data cached when calculating the sum" is interesting to me. Can you give some suggestions on how to avoid that? — Resorter, Oct 04 '16 at 19:42
Figure out, from the index calculations, how to structure the loops so that the array accesses are closer together. You could use some temporary variables that load the sequential values (such as `ATemp1`, `ATemp2`) and then calculate the sum based on the temporary values. — Thomas Matthews, Oct 04 '16 at 19:59
Do you mean dividing the matrix into blocks and dissect the calculation block-wisely? — Resorter, Oct 04 '16 at 22:16

How to write fast code for matrix multiplication in C++?

1 Answers1