0

I am learning to program with AVX. So, I wrote a simple program to multiply matrices of size 4. While with no compiler optimizations, the AVX version is slightly faster than the non-AVX version, with O3 optimization, the non-AVX version becomes almost twice as fast as the AVX version. Any tip on how can I improve the performance of the AVX version? Following is the full code.

#include <immintrin.h>
#include <stdio.h>       
#include <stdlib.h>      

#define MAT_SIZE    4
#define USE_AVX

double A[MAT_SIZE][MAT_SIZE];
double B[MAT_SIZE][MAT_SIZE];
double C[MAT_SIZE][MAT_SIZE];

union {
    double m[4][4];
    __m256d row[4];
} matB;

void init_matrices()
{
    for(int i = 0; i < MAT_SIZE; i++)
        for(int j = 0; j < MAT_SIZE; j++)
        {
            A[i][j] = (float)(i+j);
            B[i][j] = (float)(i+j+1);
            matB.m[i][j] = B[i][j];
        }
}

void print_result()
{
    for(int i = 0; i < MAT_SIZE; i++)
    {
        for(int j = 0; j < MAT_SIZE; j++)
        {
            printf("%.1f\t", C[i][j]);
        }
        printf("\n");
    }
}

void withoutAVX()
{
    for(int row = 0; row < MAT_SIZE; row++)
        for(int col = 0; col < MAT_SIZE; col++)
        {
            float sum = 0;
            for(int e = 0; e < MAT_SIZE; e++)
                sum += A[row][e] * B[e][col];
            C[row][col] = sum;
        }
}

void withAVX()
{
    for(int row = 0; row < 4; row++)
    {
        //calculate_resultant_row(row);
        const double* rowA = (const double*)&A[row];
        __m256d* pr = (__m256d*)(&C[row]);

        *pr = _mm256_mul_pd(_mm256_broadcast_sd(&rowA[0]), matB.row[0]);
        for(int i = 1; i < 4; i++)
            *pr = _mm256_add_pd(*pr, _mm256_mul_pd(_mm256_broadcast_sd(&rowA[i]), 
                matB.row[i]));
    }
}

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

int main() 
{
    init_matrices();

    // start timer
    unsigned long long cycles = rdtsc();
#ifdef USE_AVX
    withAVX();
#else
    withoutAVX();
#endif
    // stop timer
    cycles = rdtsc() - cycles;

    printf("\nTotal time elapsed : %ld\n\n", cycles); 
    print_result();
    return 0;
}
pythonic
  • 20,589
  • 43
  • 136
  • 219
  • 1
    You should time more than one repeat of your matmul so it takes long enough to measure accurately. Use `perf stat` to time the whole program with performance counters, so you don't have to worry about CPU frequency changes (powersaving + turbo). – Peter Cordes Dec 10 '16 at 15:24
  • 1
    Or use `cpuid; rdtsc` to serialize the pipeline for RDTSC, to prevent out-of-order execution from running RDTSC before the matmul is really done. – Peter Cordes Dec 10 '16 at 15:25
  • 1
    Also note that time with `-O0` is not particularly useful or meaningful. Checking with `-O2` might be meaningful since gcc only auto-vectorizes at `-O3` – Peter Cordes Dec 10 '16 at 15:32

1 Answers1

3

It's hard to be sure without knowing exactly what compiler and system you are using. You need to check the assembly of generated code to be sure. Below are merely some possible reasons.

The compiler probably generated extra load/store. This will cost.

The inner most loop broadcast elements from A. And thus you have extra loads. The optimal code shall require only 8 loads, 4 for A and B each, and 4 store back in C. However your code will lead to at least 16 extra loads because your use of broadcastsd. These will cost you as much as the computation itself, and probably more.

Edit (too long for comments)

There are situations where compiler won't be able to do smart optimization or sometime it is "too clever" for good. Recently I even had need to use assembly to avoid compiler optimization which actually lead to bad code! That said, if what you need is performance and you don't really care how you get there. I would suggest you first look for good libraries. For example, Eigen for linear algebra will fit your need in this example perfectly. If you do want to learn SIMD programming, I suggest you start with simpler cases, such as adding two vectors. Most likely, you will find that compiler will be able to generate better vectorized binary than your first few attempts. But they are more straightforward and thus you will see where you need improvement more easily. You will learn all kinds of things that you need to write optimal code in the process of attempting to produce code as good as or better than that can be generated by a compiler. And eventually you will be able to provide optimal implementations to code that compiler cannot optimize. One thing you need to keep in mind is that the lower level you go, the less compiler can do for you. You will have more control over what binaries are generated, but it is also your responsibility to make them optimal. These advices are pretty vague. Sorry cannot be of more help.

Yan Zhou
  • 2,709
  • 2
  • 22
  • 37
  • Any idea how we can avoid these extra loads? What about _mm256_set1_pd? – pythonic Dec 10 '16 at 11:56
  • 1
    @pythonic To be honest, optimal implementation of small matrix multiplication is actually quite difficult. You can look into Eigen's source and see how it does it. Ideas out of my head is that at the entry of the function you do 8 load from A and B, either row by row or column by column, depending on how you store the data. And then you can transpose one of them by the unpack instructions. And do the multiplication. The horizontal addition can be done with hadd. And last do four stores into the C matrix. However, it's hard to say if this will be better or worse without a lot experiments – Yan Zhou Dec 10 '16 at 12:00
  • I see. But I see this code (https://gist.github.com/rygorous/4172889), which looks very similar to mine, and they report great improvements over the original code. Its strange, isn't it? Also the compiler I'm using is gcc 6.1. Maybe it does super optimizations on the code, that is why its so fast! With compilers becoming smarter and smarter, do you thing AVX programming is even worth learning? Or should I just switch to learning GPUs :)? – pythonic Dec 10 '16 at 12:05
  • @pythonic I added some comments in the answer, which is too long for the comments – Yan Zhou Dec 10 '16 at 12:16
  • Thanks a lot for such useful information. I was finally able to get a speedup of more than 2, by using matrix of 8x8 with single precision. In the unoptimzed version, speedup goes up to 14x! So very signficant! I'll also put an answer, so that it is useful for other people too. – pythonic Dec 10 '16 at 13:18
  • 1
    @pythonic: Have a look at Intel's [AVX for small matrices](https://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices). But note that it's talking about AVX1 only, where lane-crossing shuffles with granularity less than 128b aren't available (other than broadcast-loads from memory). Although that's all you need for matmul, isn't it? – Peter Cordes Dec 10 '16 at 15:30