How to Multiply 2 16 bit vectors and store result in 32 bit vector in sse?

Question

I need to multiply 2 16 bit vectors and want to get output in 32 bit vectors due to overflow issue similar as below.

   A = [ 1, 2, 3, 4, 5, 6, 7, 8]
   B = [ 1, 3, 5, 6, 8, 9, 10 ,12 ]

   C1= [ 1*1 + 2*3, 3*5, 4*6]
   c2= [ 5*8, 6* 9, 7*10, 8*12 ]

I was able to do this by first dividing A and B into 32 bit vectors and then using my multiplication function below

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
    __m128i tmp1 = _mm_mul_epu32(a, b); /* mul 2,0*/
    __m128i tmp2 = _mm_mul_epu32(_mm_srli_si128(a, 4), _mm_srli_si128(b, 4)); /* mul 3,1 */
    return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE(0, 0, 2, 0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE(0, 0, 2, 0))); /* shuffle results to [63..0] and pack */
}

But I believe this is not efficient and we can use _mm_mullo_epi16 to make it more efficient. Can some one please suggest/post code to achieve this ?

Paul R · Accepted Answer · 2016-03-09T08:45:37.217

4

Yes, you can do it like this:

static inline void muly(__m128i &vh, __m128i &vl,           // output - 2x4xint32_t
                        const __m128i v0, const __m128i v1) // input  - 2x8xint16_t
    const __m128i vhi = _mm_mulhi_epi16(v0, v1);            // mul lo
    const __m128i vlo = _mm_mullo_epi16(v0, v1);            // mul hi
    vh = _mm_unpackhi_epi16(vlo, vhi);                      // interleave results
    vl = _mm_unpacklo_epi16(vlo, vhi);
}

Note that for 16x16 multiply you might want to consider a fixed point multiply instead. This approach is commonly used for DSP and image processing tasks such as filtering. It is much more efficient than a full 16x16->32 multiply, and also avoids the need for data widening and scaling back down to 16 bits.

See: __m128i _mm_mulhrs_epi16 (__m128i a, __m128i b)

edited Mar 09 '16 at 08:45

answered Mar 09 '16 at 08:37

Paul R

208,748
37
389
560

Thanks so much again paul – Bharat Ahuja Mar 09 '16 at 08:42
Hi paul if i understand it corectly __mm_mulhrs_epi16 multiples two 16 bits produces 32 bit temp and then truncates it to 16 bit right? The reason i am using 32 bit result since i want to accumulate all Gaussian coefficients multiplicaton result before truncatingsince there is significant diff i am observing if add all 32 bit Gauss multiplication and then truncate or truncate during multiplication only.. – Bharat Ahuja Mar 09 '16 at 08:55
OK - that might be an issue then, if 16 bit intermediate terms are not accurate enough. If you are ultimately generating 8 bit pixel components though I would have thought that it would be good enough, unless your filter is very large (?). Anyway, carry on with the full 32 bit approach - you can always consider further optimisations later, once the first version is working. – Paul R Mar 09 '16 at 10:04
1

Thanks for helping with this paul. yes filter is very large in some cases. Anyways i was first trying to get it working i have got significant improvement for now i will further analyze this if more improvement is required. thanks again – Bharat Ahuja Mar 09 '16 at 12:08

How to Multiply 2 16 bit vectors and store result in 32 bit vector in sse?

1 Answers1