I am considering simple problem - speeding up the calculation of component-wise product of two arrays of doubles. I have noticed that using AVX commands I get only around 20% speedup, comparing to sequential multiplication in a loop.
I decided to check the latencies for the both cases and became confused with the assembly code of the load operation:
### __m256d pd;
### pd = _mm256_load_pd(a);
movq -112(%rbp), %rax //Pushing the pointer to the stack
vmovapd (%rax), %ymm0 //Pushing 32 bytes from memory to ymm
vmovapd %ymm0, -80(%rbp) //What is
vmovapd -80(%rbp), %ymm0 //happening here?
vmovapd %ymm0, -48(%rbp) //Quite slow down, since vmovapd cost ~ vmulpd
Above is part of assembly for the following C code:
inline int test(double * a) {
__m256d pd;
pd = _mm256_load_pd(a);
return 1;
}
In the description of __m256_load_pd it is said that it is done in this way:
dst[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
i.e. in reverse order? But how these 2 lines of assembly code have to do anything with that?