0

I am in the process learning HLA Assembly from the book, Art of Assembly Language, 2nd Edition. I just started learning about the shr and shl instructions and i would like to know if shifting by a larger amount would take more time than shifting by a smaller amount. shr(1,dest) vs shr(7,dest).

I'm sorry if the syntax for the instructions are wrong.

Raymond Chen
  • 44,448
  • 11
  • 96
  • 135
Fence_rider
  • 163
  • 1
  • 14
  • 2
    It depends on the processor. You need to read the data sheet for the performance characteristics for the processor you are targeting. Modern implementations use a barrel shifter which doesn't care how much you are shifting by. – Raymond Chen Sep 06 '15 at 22:43

1 Answers1

1

http://agner.org/optimize/ has instruction timings for x86 CPUs, and microarch guides.

Shift and rotate with an immediate (compile-time-constant) count are single cycle latency on recent AMD and Intel.

Rotate-through-carry by any count other than 1 is slow, but probably constant-time. (data-dependent timing makes out-of-order execution dependency tracking even trickier, so I think they just take the maximum.

Another strange thing: apparently IvyBridge / Haswell take an extra uop for the short-form ROL / ROR rotate-by-1 opcode, so throughput is half compared to the normal opcode with an imm8 count of 1.

re: HLA: C and C++ compilers have good support for intrinsics now (functions that turn into inline instructions). There's not as much of a use-case for HLA anymore, I think I remember reading. According to some source I can't remember (sorry >.<), these days you might as well just learn normal asm. A lot of the time, you can get speedups from using vector instructions (or bit-manipulation, like popcount) through intrinsics in C/C++.

If you're having fun learning HLA, and think it's useful, then best of luck to you, though.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847