Here's two different ways that I could potentially do shift left by >= 64 bits with SSE intrinsics. The second variation treats the (shift == 64) case specially, and avoiding one SSE instruction, but adding the cost of an if check:
inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
__m128i r ;
r = _mm_slli_si128( a, 8 ) ; // a << 64
r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
return r ;
}
inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
__m128i r ;
r = _mm_slli_si128( a, 8 ) ; // a << 64
if ( shift > 64 )
{
r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
}
return r ;
}
I was wondering (roughly) how the cost of this if() check compares with the cost of the shift instruction itself (perhaps relative to the time or number of cycles required for a normal ALU shift left instruction).