1

I'm trying to vectorize matrix multiplication using blocking and vector intrinsics. It seems to me that the addition part in the vector multiplication cannot be vectorized. Could you please see if I can improve my code to vectorize further?

    double dd[4], bb[4];
    __m256d op_a, op_b, op_d;
    for(i = 0; i < num_blocks; i++){
        for(j = 0; j < num_blocks; j++){
            for(k = 0; k < num_blocks; k++){
                for(ii = 0; ii < block_size ; ii++){
                    for(kk = 0; kk < block_size; kk++){
                        for(jj = 0; jj < block_size ; jj+=4){

                            aoffset=n*(i*block_size+ii)+j*block_size +jj ;
                            boffset=n*(j*block_size+jj)+k*block_size +kk;
                            coffset=n*(i*block_size+ii)+ k*block_size + kk;

                            bb[0]=b[n*(j*block_size+jj)+k*block_size +kk];
                            bb[1]=b[n*(j*block_size+jj+1)+k*block_size +kk];
                            bb[2]=b[n*(j*block_size+jj+2)+k*block_size +kk];
                            bb[3]=b[n*(j*block_size+jj+3)+k*block_size +kk];

                            op_a = _mm256_loadu_pd (a+aoffset);
                            op_b= _mm256_loadu_pd (bb);
                            op_d = _mm256_mul_pd(op_a, op_b);
                            _mm256_storeu_pd (dd, op_d);
                            c[coffset]+=(dd[0]+dd[1]+dd[2]+dd[3]);

                        }
                    }
                }
            }
        }
    }

Thanks.

Z boson
  • 32,619
  • 11
  • 123
  • 226
the_naive
  • 2,936
  • 6
  • 39
  • 68
  • 1
    Your kernel has several inefficiencies. For one you should reorder the inner block so that you can read it horizontally rather than vertically into `b`. You should also store directly into `c` and not use `dd`. But there are more problems. I'll try and give in answer over the next days if someone else does not. You should measure the efficiency of your method so you have a reference. – Z boson Oct 27 '14 at 09:17
  • 1
    Can't you use an existing implementation? Matrix multiplication is well-trod turf, and there are multiple existing libraries which accelerate it already. What are you trying to achieve beyond the prior art? – John Zwinck Oct 27 '14 at 12:21
  • 1
    @JohnZwinck I'm trying to learn to do Matix to matrix multiplication using vectorization and vector intrinsics. Then once, I manage my understanding on vectorization in single, I will try to learn about MPI and OpenMP. It's not a professional task, it's my self-learning. – the_naive Oct 27 '14 at 21:03
  • @the_naive, there is nothing wrong with that. I'm doing the same thing. If you want to cut to the chase then look at the source code of [openblas](http://www.openblas.net/). – Z boson Oct 28 '14 at 08:35

2 Answers2

0

You can use this version of matrix multiplication (c[i,j] = a[i,k]*b[k,j]) algorithm (scalar version):

for(int i = 0; i < i_size; ++i)
{
    for(int j = 0; j < j_size; ++j)
         c[i][j] = 0;

    for(int k = 0; k < k_size; ++k)
    {
         double aa = a[i][k];
         for(int j = 0; j < j_size; ++j)
             c[i][j] += aa*b[k][j];
    }
}

And vectorized version:

for(int i = 0; i < i_size; ++i)
{
    for(int j = 0; j < j_size; j += 4)
         _mm256_store_pd(c[i] + j, _mm256_setzero_pd());

    for(int k = 0; k < k_size; ++k)
    {
         __m256d aa = _mm256_set1_pd(a[i][k]);
         for(int j = 0; j < j_size; j += 4)
         {
             _mm256_store_pd(c[i] + j, _mm256_add_pd(_mm256_load_pd(c[i] + j), _mm256_mul_pd(aa, _mm256_load_pd(b[k] + j))));
         }
    }
}
ErmIg
  • 3,980
  • 1
  • 27
  • 40
  • 1
    This has terrible cache characteristics compared to the code in the question, and it also doesn't read the right values from `b` (or else `a`). One is a column vector, one is a row vector, and you're treating both equally. – Ben Voigt Oct 27 '14 at 15:18
0

"Horizontal add" is a more recent edition to the SSE instruction set, so you can't use the accelerated version if compatibility with many different processors is your goal.

However, you definitely can vectorize the additions. Note that the inner loop only affects a single coffset. You should move the coffset calculation outward (the compiler will do this automatically, but the code is more readable if you do) and also use four accumulators in the innermost loop, performing the horizontal add only once per coffset. This is an improvement even if vector horizontal add is used, and for the scalar horizontal add, it's pretty big.

Something like:

for(kk = 0; kk < block_size; kk++){
    op_e = _mm256_setzero_pd();

    for(jj = 0; jj < block_size ; jj+=4){
        aoffset=n*(i*block_size+ii)+j*block_size +jj ;
        boffset=n*(j*block_size+jj)+k*block_size +kk;

        bb[0]=b[n*(j*block_size+jj)+k*block_size +kk];
        bb[1]=b[n*(j*block_size+jj+1)+k*block_size +kk];
        bb[2]=b[n*(j*block_size+jj+2)+k*block_size +kk];
        bb[3]=b[n*(j*block_size+jj+3)+k*block_size +kk];

        op_a = _mm256_loadu_pd (a+aoffset);
        op_b= _mm256_loadu_pd (bb);
        op_d = _mm256_mul_pd(op_a, op_b);
        op_e = _mm256_add_pd(op_e, op_d);
    }
    _mm256_storeu_pd(dd, op_e);
    coffset = n*(i*block_size+ii)+ k*block_size + kk;
    c[coffset] = (dd[0]+dd[1]+dd[2]+dd[3]);
}

You can also speed this up by doing a transpose on b beforehand, instead of gathering the vector inside the innermost loop.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720