-2

I am considering simple problem - speeding up the calculation of component-wise product of two arrays of doubles. I have noticed that using AVX commands I get only around 20% speedup, comparing to sequential multiplication in a loop.

I decided to check the latencies for the both cases and became confused with the assembly code of the load operation:

### __m256d pd;
### pd = _mm256_load_pd(a);
    movq      -112(%rbp), %rax    //Pushing the pointer to the stack                              
    vmovapd   (%rax), %ymm0       //Pushing 32 bytes from memory to ymm                                 
    vmovapd   %ymm0, -80(%rbp)    //What is                              
    vmovapd   -80(%rbp), %ymm0    //happening here?                         
    vmovapd   %ymm0, -48(%rbp)    //Quite slow down, since vmovapd cost ~ vmulpd                          

Above is part of assembly for the following C code:

inline int test(double * a) {
    __m256d pd;
    pd = _mm256_load_pd(a);
    return 1;
}

In the description of __m256_load_pd it is said that it is done in this way:

dst[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0

i.e. in reverse order? But how these 2 lines of assembly code have to do anything with that?

Tzoiker
  • 1,354
  • 1
  • 14
  • 23
  • 1
    You're compiling with optimization disabled, so gcc makes braindead slow code. With optimization, `test()` compiles away to just the `return 1`, because `pd` is never used. If the 20% speedup is with `-O0`, then try with `-O3`. You have to enable optimization if you want code to run fast. – Peter Cordes Mar 25 '16 at 10:36
  • 1
    AT&T syntax uses `op src2,src1,dest` with `%` decorators on register names, while Intel syntax uses `op dest, src1, src2`. The Intel manual's pseudocode isn't asm at all, it just describes the operation. – Peter Cordes Mar 25 '16 at 10:39
  • 20% speedup is the result for -O3 flag with icpc. – Tzoiker Mar 25 '16 at 11:04
  • Is Intel's C++ compiler already auto-vectorizing your scalar code? The only useful way to understand/explain microbenchmark results is to look at the optimized asm. Preferably from `-march=native` so the auto-vectorizer isn't limited to SSE2 as a baseline. – Peter Cordes Mar 25 '16 at 11:13
  • I specified -no-vec flag as well. Thanks for pointing out `-march=native` though. – Tzoiker Mar 25 '16 at 11:15

1 Answers1

0

It is very hard to answer this question without a bit more context, beyond compilation flags. Depending on the number of threads operating, and the size of the arrays of double, it might be that your problem is memory bound, that is performance is limited by memory access.

VERY LARGE DATA You want to have a look at Stream benchmark. Using AVX intrinsics will help software and hardware prefetch as well as data alignment which might explain the +20%. Anyways, you will be memory bound (loading data from system memory to L1 cache), and performing operations sequantialy will be irrelevant compared to AVX operations, and probably hidden by the data loading occurring in parallel (thanks to the magic of prefetch).

DATA IN L2 CACHE Depending on the number of concurrent threads, data may be fed at a higher pace. However, performing a mul on the data is not complex enough compute on the data to take the lead on cost of memory marshalling.

DATA IN L1 CACHE For this case, you may see some improvement in the performance, but there might be other suprises such as execution dependency and latency induced by the L1 Cache loads. However, loading aligned data in 256bits registers is probably the best performing.

Florent DUGUET
  • 2,786
  • 16
  • 28