Is there anyway to use pmaddubsw for unsigned by unsigned multiplication more efficiently than pmullw?

Question

Pmaddubsw is a fascinating instruction since it performs Unsigned by Signed Multiplication. What this means in practice is that, the order of the operands matter, and if you try to multiply an unsigned value as a signed value that has a value higher than 127, you will get rather unexpected results in the final bit representation.

This is of interest to me, because I'm proposing an expanding (8bit to 16bit) horizontal multiply-add instruction for the WebAssembly SIMD instruction set.

What I'd like to do then is determine which is the ideal implementation for said instruction on x86_64 targetting AVX. If I can use pmaddubsw in 1-op, it would be more ideal than the 7 op solution required with pmullw, pand, psrlw. But with the limitations being on unsigned by signed multiplication, I'm not sure if it's possible to end up with an unsigned result in 1-op or at the very least, end up with a solution better than the pmullw, pand, psrlw solution.

The only way I've come up with that matches the behavior is to mask, shift, and call pmaddubsw twice which yields the same number of instructions and not necessarily a more optimal solution.

You can see both on Godbolt here.

Side note: someone has different task, but a similar objective with this question -- Unsigned Multiplication using Signed Multiplier

Just to clarify: You want the same behavior as `pmaddubsw` just with two unsigned inputs? (i.e, horizontal addition of two products and saturation -- in this case to `[0,0xffff]`). — chtz, Oct 22 '20 at 11:56
That would be acceptable I think. With or without saturation. — Dan Weber, Oct 22 '20 at 17:06
The saturation is what kills this idea. Converting signed/unsigned multiplication to unsigned/unsigned would be easy enough, just add `((signed>>8)&unsigned)<<8` to the result. Even the `add` part of `pmaddubsw` wouldn't normally hurt, because normally binary addition (signed or unsigned) is associative. But saturating addition is *not* associative, so you can't fix the signedness of the multiplication if the addition saturated. — EOF, Oct 22 '20 at 17:12
If it's called twice, then the saturation isn't at play right? — Dan Weber, Oct 22 '20 at 17:25
On Intel-uarchs, the code that computes the multiplication of the uppermost bit separately uses just as many long-latency multiplying instructions as the `vpmullw`-using code, and as many instructions in total. I would expect it to have similar throughput, but it has worse latency, because its dependency tree is deeper. You might try to replace one of the `vpmaddubsw` with an `AND` with the result of a byte-comparison with zero, because it only has a single bit of input, avoiding the costly multliplying instruction, but using more shift instructions. — EOF, Oct 22 '20 at 21:26

Is there anyway to use pmaddubsw for unsigned by unsigned multiplication more efficiently than pmullw?

0 Answers0