SSE Efficient signed short convolution

Question

I am trying to implement fixed point 7X7 convolution on large signed short images (1000X1000). The (float) kernel is scaled up (by 1<<14) to get valid results, and the final results are scaled down back.

I am implementing it using SSE.

Working on integer vectors main problem is that any multiplication function gives partial result (lower/upper) or immediately scales down the result mulhrs.

To overcome this I am forced to convert the 16 bits results into 32bits:

    kernelVec   =   _mm_set1_epi16(kernel); \
    inputVec    =   _mm_srli_epi16(_mm_lddqu_si128(vInput),(shift)); \
    mulLowVec   =   _mm_mullo_epi16(inputVec, kernelVec); \
    mulHighVec  =   _mm_mulhi_epi16(inputVec, kernelVec);\
    sumLeft     =   _mm_add_epi32(sumLeft,  _mm_unpacklo_epi16(mulLowVec, mulHighVec));\
    sumRight    =   _mm_add_epi32(sumRight, _mm_unpackhi_epi16(mulLowVec, mulHighVec)); \

And all this for single multiplication of 8 elements by single kernel value.

As so - I tried converting the input data into float and implement it using avx functions (no align for 256 so I have to reload constantly...):

    kernelVec   =   _mm256_set1_ps(kernel); \
    inputVec    =   _mm256_loadu_ps(fInput); \
    sum         =   _mm256_fmadd_ps(inputVec,kernelVec,sum);\

The result is then converted back to 16 bit shorts. The floating numbers implementation proved to be 2.3 faster than the integer one.

I know the ipp library has ippsConv_16s_Sfs which should do the same. Does anyone has any suggestions ?

Consider using [`_mm_mulhrs_epi16`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_mulhrs_epi16&expand=3216,3244,3690) - this gives you fixed point multiplication in one instruction. — Paul R, Aug 04 '16 at 10:35
If your input/output format is 16-bit, I think it's normal to need wider temporaries than that to avoid losing a lot of precision. Are you looking for a widening multiply or something? There's [`PMADDWD`](http://www.felixcloutier.com/x86/PMADDWD.html), which does a full-multiply and horizontal add of adjacent pairs. (The 256bit version requires AVX2: `_mm256_madd_epi16`). — Peter Cordes, Aug 04 '16 at 17:32

SSE Efficient signed short convolution

0 Answers0