Why on earth would I want to use PMULHRSW/VPMULHRSW?

Question

I was looking for an appropriate AVX2 multiplication instruction to use in my code, and came across the vpmulhrsw (_mm256_mulhrs_epi16(__m256i a, __m256i b)) instruction.

The description on the Intel Intrinsics Guide says:

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.

I understand what the instruction does, but the instruction sounds like it is tailored for some very specific use case. What is this use case?

Looks like these are primarily used in various video codecs to implement various takes on DCT/FFT, like so: https://github.com/webmproject/libvpx/blob/a5d499e16570d00d5e1348b1c7977ced7af3670f/vpx_dsp/x86/inv_txfm_ssse3.h#L47-L52 — oakad, Oct 04 '22 at 04:27
Related: [why does \_mm\_mulhrs\_epi16() always do biased rounding to positive infinity?](https://stackoverflow.com/q/28246447) Also, Stephen Canon's answer on [Expected speedup from the use of SSSE3 on an Intel machine](https://stackoverflow.com/q/13232668) mentions that the use-cases include "color space transformations, some alpha operations". — Peter Cordes, Oct 04 '22 at 05:40
It’s intended for fixed point multiply/accumulate, which is widely used in DSP, image processing, control systems, etc. — Paul R, Oct 04 '22 at 06:41

Soonts · Answer 1 · 2022-10-05T01:44:05.490

The use case is scaling 16-bit numbers.

Here’s an example in C++, for that project I needed to apply volume to 16 bit PCM audio.

class ScaleVolume
{
    __m128i scale;
public:
    ScaleVolume( uint8_t v ) noexcept
    {
        uint32_t scaling = 0x8000u * v / 255;
        scale = _mm_set1_epi16( (int16_t)(uint16_t)scaling );
    }

    __forceinline __m128i load8( const int16_t* integers ) const noexcept
    {
        __m128i src = _mm_loadu_si128( ( const __m128i* )integers );
        return _mm_mulhrs_epi16( src, scale );
    }
};

_mm_mulhi_epi16 doesn’t work for that, impossible to achieve scaling >= 50%.
50% scaling needs multiplier 0x8000 = +32768, that number doesn't fit into a signed int16_t integer.
100% scaling needs multiplier 0x10000, that number doesn't fit into 16 bits, no matter signed or not.

Hmm, for your case, it seems that the rounding direction isn't important? — Bernard, Oct 06 '22 at 02:19
@Bernard It’s important to support scaling multiplier >= 50%. About rounding mode, rounding to nearest integer is nice, but indeed unimportant for the use case. — Soonts, Oct 06 '22 at 08:17

Why on earth would I want to use PMULHRSW/VPMULHRSW?

1 Answers1

Linked