Do x86 and other architectures have a fused shift and add?

Question

A number of architectures support fused multiply and add such as x86 with pmaddwd (and its SSE extensions), but I am unaware of any x86 fused shift and add which is effectively equivalent to FMA. This question is predominantly about x86, but knowing about other architectures would be useful as well.

Is there a way to effectively get a fused shift and add based on CPU-family-specific IPC perhaps?

Both shx/shr/shl and add/adc/sub are listed as one cycle with reciprocal throughput of 0.25 according to AMD's Family 17h Instruction Latencies version 1.00 spreadsheet.

But for a use-case that matches a fused shift/add, they need to operate on the same input and are dependent on each other so they necessarily will execute sequentially in two cycles. Using FMA instead would be three cycles regardless (with worse throughput).

There is `lea` if you want to shift left by 1, 2 or 3, but otherwise I don't know of any. — Nate Eldredge, Dec 31 '21 at 17:46
ARM and ARM64 do have this; there is a form of the `add` instruction that applies a shift to one operand before adding. ARM32 allows either a constant or variable shift; ARM64 has constant shift only. — Nate Eldredge, Dec 31 '21 at 17:48
But whether the execution is actually "fused" is up to the specific microarchitecture. On my Cortex-A72 it appears that the shifted version of `add` takes an extra uop, so the only benefit over separate shift and add instructions is to reduce code size and avoid the need for a scratch register. — Nate Eldredge, Dec 31 '21 at 17:53
What if you want to do something like `x << (-popcnt(x) - 1)`? — AMDG, Dec 31 '21 at 17:54
@AMDG That is just a *simple shift* with variable shift count: `x << (~popcnt(x))`. Where does shift-add come into play? Did you mean to place parentheses differently? FWIW, various NVIDIA GPU architectures have an `ISCADD` instruction which is left-shift & add (presumably the name is derived from **i**nteger **sc**ale & **add**). — njuffa, Dec 31 '21 at 20:56
Note "fused shift-add" suggests to me `a + (b << c)`, not `(a+b) << c` or `a << (b + c)`, and that's the analogy of what FMA does. The other versions don't seem likely to exist as instructions. — Nate Eldredge, Dec 31 '21 at 22:48
Historical perspective: The PA-RISC architecture had `SH1ADD`, `SH2ADD`, and `SH3ADD` instructions, so fused shift-add `(a << s) + b` with `s` limited to a fixed value of 1, 2, or 3. As with x86 `LEA` this was presumably primarily intended for address arithmetic. — njuffa, Dec 31 '21 at 22:59
@njuffa I mean fused shift-add as a direct analog to FMA: given `a * b + a * c` where `a` is a perfect power of two and `b` and `c` are signed or unsigned integers, compute `(b << a) + (c << a)`. — AMDG, Jan 01 '22 at 01:08
@AMDG As Nate Eldredge already pointed out, in established terminology fused left-shift-add is `(a << b) + c`; compare analogous `fma (a, b, c) = a * b + c`. Given that, how does this operation fit with the given example `x << (-popcnt(x) - 1) == x << (~popcnt(x))`, which is just a simple shift? — njuffa, Jan 01 '22 at 01:21
@njuffa Suppose we want to compute a rational divide or a reciprocal using a linear approximation, $f(x) = (3 - x 2^(-floor(log_2(x)))) 2^(-floor(log_2(x)))$, then apply iterations of the form $f(x) f(x f(x))$ for rapid convergence (better than Newton-Raphson in certain cases). On its own, f(x) can be computed in 4 cycles. Having fused shift-add would make it 3 cycles for f(x) alone making subsequent iterations faster. — AMDG, Jan 01 '22 at 01:23
`a*b + a*c` is what `pmaddwd` does, you're right, but the term "fused multiply-add" more commonly means `a+(b*c)`. https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation — Nate Eldredge, Jan 01 '22 at 01:38
Correction: the outermost coefficient in f(x) should be $2^(-1 - floor(log_2(x)))$. — AMDG, Jan 01 '22 at 02:12

Do x86 and other architectures have a fused shift and add?

0 Answers0