I am searching for the most efficient way to multiply two aligned int16_t arrays whose length can be divided by 16 with AVX2.
After multiplication into a vector x
I started with _mm256_extracti128_si256
and _mm256_castsi256_si128
to have the low and high part of x
and added them with _mm_add_epi16
.
I copied the result register and applied _mm_move_epi64
to the original register and added both again with _mm_add_epi16
. Now, I think that I have:
-, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0
within the 128bit register. But now I am stuck and don't know how to efficiently sum up the remaining four entries and how to extract the 16bit result.