SM5 HLSL assembly: under what circumstances will a bfi + a shift have equal or better efficiency than two shifts and an or/xor?

Asked Jan 13 '16 at 17:26

Active Jan 13 '16 at 17:26

Viewed 240 times

Context: I'm doing repeated bitwise rotations, and I've found that the standard rotate can be replicated with a shift and a bfi. Of note: when I compile an HLSL file that uses rotates with FXC, it uses lots of bfi instructions for this purpose.

When I run my own shader assembly's bytecode using the standard (shift, shift, xor) and compare it to my own using the shift/bfi rotate, the standard performs better.

HOWEVER, on running the equivalent FXC compiled bytecode, I've found that shift/bfi performs as well as or better than the shift/shift/xor.

Is there some way to structure the shift/bfi so as to maximize its performance? Some sort of register-usage or instruction-ordering wizardry?

asked Jan 13 '16 at 17:26

MNagy

Is there a particular GPU you're testing on that gives you the results you're claiming? – Adam Miles Jan 13 '16 at 21:55
AMD Radeon 7800 HD Series is what I'm currently using – MNagy Jan 14 '16 at 00:55
In which case have you tried using ShaderAnalyzer for GCN as part of GPU PerfStudio? It'll show you the actual assembly for that particular GPU rather than DXBC which is merely an intermediate language which often bears little relation to the final microcode. – Adam Miles Jan 14 '16 at 05:07

SM5 HLSL assembly: under what circumstances will a bfi + a shift have equal or better efficiency than two shifts and an or/xor?

0 Answers0