A number of architectures support fused multiply and add such as x86 with pmaddwd
(and its SSE extensions), but I am unaware of any x86 fused shift and add which is effectively equivalent to FMA. This question is predominantly about x86, but knowing about other architectures would be useful as well.
Is there a way to effectively get a fused shift and add based on CPU-family-specific IPC perhaps?
Both shx/shr/shl
and add/adc/sub
are listed as one cycle with reciprocal throughput of 0.25 according to AMD's Family 17h Instruction Latencies version 1.00 spreadsheet.
But for a use-case that matches a fused shift/add, they need to operate on the same input and are dependent on each other so they necessarily will execute sequentially in two cycles. Using FMA instead would be three cycles regardless (with worse throughput).