I need to multiply 2 16 bit vectors and want to get output in 32 bit vectors due to overflow issue similar as below.
A = [ 1, 2, 3, 4, 5, 6, 7, 8]
B = [ 1, 3, 5, 6, 8, 9, 10 ,12 ]
C1= [ 1*1 + 2*3, 3*5, 4*6]
c2= [ 5*8, 6* 9, 7*10, 8*12 ]
I was able to do this by first dividing A and B into 32 bit vectors and then using my multiplication function below
static inline __m128i muly(const __m128i &a, const __m128i &b)
{
__m128i tmp1 = _mm_mul_epu32(a, b); /* mul 2,0*/
__m128i tmp2 = _mm_mul_epu32(_mm_srli_si128(a, 4), _mm_srli_si128(b, 4)); /* mul 3,1 */
return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE(0, 0, 2, 0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE(0, 0, 2, 0))); /* shuffle results to [63..0] and pack */
}
But I believe this is not efficient and we can use _mm_mullo_epi16
to make it more efficient. Can some one please suggest/post code to achieve this ?