I am trying to implement fixed point 7X7 convolution on large signed short images (1000X1000). The (float) kernel is scaled up (by 1<<14) to get valid results, and the final results are scaled down back.
I am implementing it using SSE.
Working on integer vectors main problem is that any multiplication function gives partial result (lower/upper) or immediately scales down the result mulhrs.
To overcome this I am forced to convert the 16 bits results into 32bits:
kernelVec = _mm_set1_epi16(kernel); \
inputVec = _mm_srli_epi16(_mm_lddqu_si128(vInput),(shift)); \
mulLowVec = _mm_mullo_epi16(inputVec, kernelVec); \
mulHighVec = _mm_mulhi_epi16(inputVec, kernelVec);\
sumLeft = _mm_add_epi32(sumLeft, _mm_unpacklo_epi16(mulLowVec, mulHighVec));\
sumRight = _mm_add_epi32(sumRight, _mm_unpackhi_epi16(mulLowVec, mulHighVec)); \
And all this for single multiplication of 8 elements by single kernel value.
As so - I tried converting the input data into float and implement it using avx functions (no align for 256 so I have to reload constantly...):
kernelVec = _mm256_set1_ps(kernel); \
inputVec = _mm256_loadu_ps(fInput); \
sum = _mm256_fmadd_ps(inputVec,kernelVec,sum);\
The result is then converted back to 16 bit shorts. The floating numbers implementation proved to be 2.3 faster than the integer one.
I know the ipp library has ippsConv_16s_Sfs which should do the same. Does anyone has any suggestions ?