How to rotate an SSE/AVX vector

Question

I need to perform a rotate operation with as little clock cycles as possible. In the first case let's assume __m128i as source and dest type:

source: || A0 || A1 || A2 || A3 ||

  dest: || A1 || A2 || A3 || A0 ||

dest = (__m128i)_mm_shuffle_epi32((__m128i)source, _MM_SHUFFLE(0,3,2,1));

Now I want to do the same with AVX intrinsics. So let's assume this time __m256i as source and dest type:

source: || A0 || A1 || A2 || A3 || A4 || A5 || A6 || A7 ||

  dest: || A1 || A2 || A3 || A4 || A5 || A6 || A7 || A0 ||

The AVX intrinsics is missing most of the corresponding SSE integer operations. Maybe there is some way go get the desired output working with the floating point version.

I've tried with:

dest = (__m256i)_mm256_shuffle_ps((__m256)source, (__m256)source, _MM_SHUFFLE(0,3,2,1));

but what I get is:

|| A0 || A2 || A3 || A4 || A5 || A6 || A7 || A1 ||

Any Idea on how to solve this in an efficient way? (without mixing SSE and AVX operation and without "manually" inverting A0 and A1

Thanks in advance!

Don't have much experience with SSE and AVX, but in the second line of code, if dest type is `__m256`, why are you casting to `__m128i`? — dario_ramos, Aug 10 '12 at 18:05
Seems like all the useful instructions are in AVX2 (why didn't they release that one *first*?) — harold, Aug 11 '12 at 09:01

score 16 · Accepted Answer · edited Dec 06 '12 at 13:15

16

My solution:

__m256 tmp =  ( __m256 ) _mm256_permute_ps((__m256)_source, _MM_SHUFFLE ( 0,3,2,1 ));
* ( _dest ) =  ( __m256i) _mm256_blend_ps(tmp, _mm256_permute2f128_ps ( tmp,tmp,1 ), 136);

edited Dec 06 '12 at 13:15

Benedikt Waldvogel

12,406
8
49
61

answered Dec 02 '12 at 00:48

user1584773

699
7
19

Any chance of an explanation of the 2 immediates your passing in on the second line? (1 and 136) I've read the docs, but am still not understanding why these particular values are what you want for this. – Orvid King May 14 '14 at 19:44
@OrvidKing: `permute2f128(tmp,tmp,1)` swaps the upper and lower 128b lanes. 136 = 0x88 = take the high element from one vector, other elements from the other (so, 0x8 in each lane, because `blendps` uses the two halves of the `imm8` for the two lanes.) – Peter Cordes Sep 10 '15 at 07:15

score 3 · Answer 2 · answered May 05 '17 at 07:32

I have not yet checked how things are with AVX, but at least for SSE, did you consider _mm_align*?

For instance, this rotates a byte vector by 2 bytes:

__m128i v;
v = _mm_alignr_epi8 (v, v, 2) // v = v[2,3,4,5,6,7,8,9,10,11,12,13,14,15,0,1]

This can be a single instruction. Also such operations are lat 1 / tp 1, i.e. fast.

AVX is likely a bit of a hassle with this approach, so an adaptation may not be useful.

How to rotate an SSE/AVX vector

2 Answers2