3

Here's two different ways that I could potentially do shift left by >= 64 bits with SSE intrinsics. The second variation treats the (shift == 64) case specially, and avoiding one SSE instruction, but adding the cost of an if check:

inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
   __m128i r ;

   r = _mm_slli_si128( a, 8 ) ; // a << 64

   r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;

   return r ;
}

inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
   __m128i r ;

   r = _mm_slli_si128( a, 8 ) ; // a << 64

   if ( shift > 64 )
   {
      r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
   }

   return r ;
}

I was wondering (roughly) how the cost of this if() check compares with the cost of the shift instruction itself (perhaps relative to the time or number of cycles required for a normal ALU shift left instruction).

Peeter Joot
  • 7,848
  • 7
  • 48
  • 82
  • How about profiling? (Also, what platform and compiler? In GCC, you could influence branch prediction, in MSVC you can't AFAIK.) –  Apr 02 '12 at 19:54
  • 1
    This is will be very difficult to profile/benchmark. Mainly because the performance will be very sensitive to the branch-predictor. If your benchmark isn't "representative" of the actual data you will use, your numbers could be very misleading. – Mysticial Apr 02 '12 at 19:57
  • No, I haven't done any sort of micro-benchmark yet. I was hoping to capitalize on somebody else having done that. – Peeter Joot Apr 02 '12 at 19:58
  • 1
    You're probably better off doing two shifts and using a conditional move to select the correct one. – Mysticial Apr 02 '12 at 19:58
  • @mystical. If shift == 64, the second operation is a no-op, so I don't need any sort of conditional move for correctness. – Peeter Joot Apr 02 '12 at 19:59
  • 1
    I'm assuming this is related to your [previous question](http://stackoverflow.com/questions/9980801/looking-for-sse-128-bit-shift-operation-for-non-immediate-shift-value). I'm saying that you can do two shifts (`shift`, and `shift - 64`). Then use a condition move to select the correct result. It'll probably still beat branching. – Mysticial Apr 02 '12 at 20:01
  • Yes, it was related. Note that I've got 5 intrinsics in the < 64 case, and 3 in the >= 64 case (perhaps not of equal weight). The implication of your comment is that it is probably less expensive to do all 8, plus an SSE cmove instruction, than to do the branching. If that's the case, then for this more specific question where I have one (possibly no-op) instruction plus a possible branch, I should expect the cost of the branch to be more. – Peeter Joot Apr 02 '12 at 20:23
  • This may be a silly suggestion depending on your use cases, but can you make this a compile-time decision? Or can you know beforehand whether or not a call to this function will have more or less than 64 bits and just split it into two functions? – Mahmoud Al-Qudsi Apr 02 '12 at 21:31

1 Answers1

1

Answered with a microbenchmark, using code like:

void timingWithIf( volatile __m128i * pA, volatile unsigned long * pShift, unsigned long n )
{
   __m128i r = *pA ;

   for ( unsigned long i = 0 ; i < n ; i++ )
   {
      r = _mm_slli_si128( r, 8 ) ; // a << 64

      unsigned long shift = *pShift ;

      // does it hurt more to do the check, or just do the operation?
      if ( shift > 64 )
      {
         r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
      }
   }

   *pA = r ;
}

This generated the following code:

    xor    %eax,%eax
    movdqa (%rdi),%xmm0
    test   %rdx,%rdx
    movdqa %xmm0,0xffffffffffffffe8(%rsp)
    jbe    F0
    pxor   %xmm0,%xmm0
B0: movdqa 0xffffffffffffffe8(%rsp),%xmm2
    pslldq $0x8,%xmm2
    movdqa %xmm2,0xffffffffffffffe8(%rsp)
    mov    (%rsi),%rcx
    cmp    $0x40,%rcx
    jbe    F1
    add    $0xffffffffffffffc0,%rcx
    movd   %ecx,%xmm1
    punpckldq %xmm0,%xmm1
    punpcklqdq %xmm0,%xmm1
    psllq  %xmm1,%xmm2
    movdqa %xmm2,0xffffffffffffffe8(%rsp)
F1: inc    %rax
    cmp    %rdx,%rax
    jb     B0
F0: movdqa 0xffffffffffffffe8(%rsp),%xmm0
    movdqa %xmm0,(%rdi)
    retq
    nopl   0x0(%rax)

Observe that the shift that the branch avoids actually takes three SSE instructions (four if you could the ALU -> XMM reg move), plus one ALU add operation:

    add    $0xffffffffffffffc0,%rcx
    movd   %ecx,%xmm1
    punpckldq %xmm0,%xmm1
    punpcklqdq %xmm0,%xmm1
    psllq  %xmm1,%xmm2

With 1 billion loops I measure:

1) shift == 64:

~2.5s with the if (avoiding the no-op shift).

~2.8s executing the no-op shift.

2) with shift == 65:

~2.8s with or without the if.

Timings were made on "Intel(R) Xeon(R) CPU X5570 @ 2.93GHz" (/proc/cpuinfo) and were relatively consistent.

Even when the branch is completely redundant (shift == 65) I don't see much difference in the time required to do the operation, but it definitely helps to avoid the instructions that would perform an SSE no-op shift left when (shift == 64).

Peeter Joot
  • 7,848
  • 7
  • 48
  • 82
  • Looking at the generated code it's rather bizarre that the (intel compiler, -O2) compiler chooses to use stack spill and reload over and over again within the loop, when it's already got the values that it needs in registers. – Peeter Joot Apr 02 '12 at 21:44