Pmaddubsw is a fascinating instruction since it performs Unsigned by Signed Multiplication. What this means in practice is that, the order of the operands matter, and if you try to multiply an unsigned value as a signed value that has a value higher than 127, you will get rather unexpected results in the final bit representation.
This is of interest to me, because I'm proposing an expanding (8bit to 16bit) horizontal multiply-add instruction for the WebAssembly SIMD instruction set.
What I'd like to do then is determine which is the ideal implementation for said instruction on x86_64 targetting AVX. If I can use pmaddubsw in 1-op, it would be more ideal than the 7 op solution required with pmullw, pand, psrlw. But with the limitations being on unsigned by signed multiplication, I'm not sure if it's possible to end up with an unsigned result in 1-op or at the very least, end up with a solution better than the pmullw, pand, psrlw solution.
The only way I've come up with that matches the behavior is to mask, shift, and call pmaddubsw twice which yields the same number of instructions and not necessarily a more optimal solution.
You can see both on Godbolt here.
Side note: someone has different task, but a similar objective with this question -- Unsigned Multiplication using Signed Multiplier