Breaking the input into low & hi, one can
__m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
__m128i lo = _mm_mullo_epi16(y, x);
__m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
__m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);
AFAIK, the high bits YY
of the YYyy|YYyy|YYyy|YYyy
multiplied by 00xx|00xx|00xx|00xx
do not interfere with the low 8 bits ??ll
, and likewise the product of YY00|YY00
* 00xx|00xx
produces the correct 8 bit product at HH00
. These two results at the correct alignment need to be blended.
__m128i x = _mm_set1_epi16(scalar_x);
, and __m128i y = _mm_loadu_si128(...);
An alternative is to use shufb
calculating LutLo[y & 15] + LutHi[y >> 4]
, where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15))
.