0

I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting.


auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
                               17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32);

__m256i vperm = _mm256_setr_epi8( 0,  2,  4,  6,  8, 10, 12, 14,
                                 16, 18, 20, 22, 24, 26, 28, 30,
                                 -1, -1, -1, -1, -1, -1, -1, -1,
                                 -1, -1, -1, -1, -1, -1, -1, -1);

auto result = _mm256_shuffle_epi8(srcData, vperm);

I'm expecting that result contains:

1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31,
0, 0, 0, 0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0

But i have instead:

1, 3, 5, 7, 9, 11, 13, 15,  1,  3,  5,  7,  9, 11, 13, 15,
0, 0, 0, 0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0

I surely misunderstood how Shuffle works. If anyone can enlighten me, it will be very appreciated :)

bolov
  • 72,283
  • 15
  • 145
  • 224
Sly14
  • 3
  • 2
  • https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_shuffle_epi8&expand=5155 , https://scc.ustc.edu.cn/zlsc/sugon/intel/compiler_c/main_cls/intref_cls/common/intref_avx2_shuffle8.htm – bolov Sep 12 '19 at 02:31
  • Is your original input from memory or from registers (also, do you have more than 32bytes of input)? Do you have any guarantees on the range of input data (i.e., will it always be in range `[0,255]` or `[-128,127]`)? If not: Do you like to have wrap-around behavior (which would be the case with your shuffle-implementation), or saturation (this is what `packuswb` or `packsswb` would do)? – chtz Sep 12 '19 at 12:21
  • `vpshufb ymm` is two in-lane 128-bit shuffles, not a 32-byte lane-crossing permute. See [Where is VPERMB in AVX2?](//stackoverflow.com/q/37980209) – Peter Cordes Sep 17 '19 at 05:52

1 Answers1

0

Yeah, to be expected. Look at the docs for _mm_shuffle_epi8. The 256bit avx version simply duplicates the behaviour of that 128bit instruction for the two 16byte values in the YMM register.

So you can shuffle the first 16 values, or the last 16 values; however you cannot shuffle values across the 16byte boundary. (You'll notice that all numbers over 16, are the same numbers minus 16. e.g. 19->3, 31->15, etc).

you'll need to do this with an additional step.

__m256i vperm = _mm256_setr_epi8( 0,  2,  4,  6,  8, 10, 12, 14,
                                 -1, -1, -1, -1, -1, -1, -1, -1,
                                  0,  2,  4,  6,  8, 10, 12, 14,
                                 -1, -1, -1, -1, -1, -1, -1, -1);

and then use _mm256_permute2f128_si256 to pull the 0th and 2nd byte into the first 128bits.

robthebloke
  • 9,331
  • 9
  • 12
  • 1
    There is no way to finish the permutation with just a single `_mm256_permute2f128_si256`. If you have AVX2 (which you need for `_mm256_shuffle_epi8`) you can use `_mm256_permute4x64_epi64`, with just AVX1 you would need to blend or do a bit-or after permuting. – chtz Sep 13 '19 at 09:35