Context: I'm doing repeated bitwise rotations, and I've found that the standard rotate can be replicated with a shift and a bfi. Of note: when I compile an HLSL file that uses rotates with FXC, it uses lots of bfi instructions for this purpose.
When I run my own shader assembly's bytecode using the standard (shift, shift, xor) and compare it to my own using the shift/bfi rotate, the standard performs better.
HOWEVER, on running the equivalent FXC compiled bytecode, I've found that shift/bfi performs as well as or better than the shift/shift/xor.
Is there some way to structure the shift/bfi so as to maximize its performance? Some sort of register-usage or instruction-ordering wizardry?