1

I was looking for an appropriate AVX2 multiplication instruction to use in my code, and came across the vpmulhrsw (_mm256_mulhrs_epi16(__m256i a, __m256i b)) instruction.

The description on the Intel Intrinsics Guide says:

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.

I understand what the instruction does, but the instruction sounds like it is tailored for some very specific use case. What is this use case?

Bernard
  • 5,209
  • 1
  • 34
  • 64
  • 3
    Looks like these are primarily used in various video codecs to implement various takes on DCT/FFT, like so: https://github.com/webmproject/libvpx/blob/a5d499e16570d00d5e1348b1c7977ced7af3670f/vpx_dsp/x86/inv_txfm_ssse3.h#L47-L52 – oakad Oct 04 '22 at 04:27
  • 2
    Related: [why does \_mm\_mulhrs\_epi16() always do biased rounding to positive infinity?](https://stackoverflow.com/q/28246447) Also, Stephen Canon's answer on [Expected speedup from the use of SSSE3 on an Intel machine](https://stackoverflow.com/q/13232668) mentions that the use-cases include "color space transformations, some alpha operations". – Peter Cordes Oct 04 '22 at 05:40
  • 2
    It’s intended for fixed point multiply/accumulate, which is widely used in DSP, image processing, control systems, etc. – Paul R Oct 04 '22 at 06:41

1 Answers1

2

The use case is scaling 16-bit numbers.

Here’s an example in C++, for that project I needed to apply volume to 16 bit PCM audio.

class ScaleVolume
{
    __m128i scale;
public:
    ScaleVolume( uint8_t v ) noexcept
    {
        uint32_t scaling = 0x8000u * v / 255;
        scale = _mm_set1_epi16( (int16_t)(uint16_t)scaling );
    }

    __forceinline __m128i load8( const int16_t* integers ) const noexcept
    {
        __m128i src = _mm_loadu_si128( ( const __m128i* )integers );
        return _mm_mulhrs_epi16( src, scale );
    }
};

_mm_mulhi_epi16 doesn’t work for that, impossible to achieve scaling >= 50%.
50% scaling needs multiplier 0x8000 = +32768, that number doesn't fit into a signed int16_t integer.
100% scaling needs multiplier 0x10000, that number doesn't fit into 16 bits, no matter signed or not.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Hmm, for your case, it seems that the rounding direction isn't important? – Bernard Oct 06 '22 at 02:19
  • 1
    @Bernard It’s important to support scaling multiplier >= 50%. About rounding mode, rounding to nearest integer is nice, but indeed unimportant for the use case. – Soonts Oct 06 '22 at 08:17