9

I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle.

How can I do it with using AVX2 instructions?

  • A specific shuffle or any shuffle as if you had a cross-lane `pshufb`? – harold Jun 05 '15 at 15:26
  • I use _mm_shuffle_epi8 for optimization of SSE code. But joint using of AVX and SSE instructions isn't good idea, is it? –  Jun 08 '15 at 05:15
  • It's fine as long as the SSE instructions are VEX-encoded. – harold Jun 08 '15 at 06:37
  • Usually you can structure things to use the available instructions to get your data sorted out. It would be nice if there was a byte-element cross-lane shuffle, for those cases where it'd be really useful, but there's only a 32b-element shuffle that crosses lanes. My point is, rather than directly using one of these nice answers, you can often avoid using that many insns in the context of your actual loop. – Peter Cordes Jul 03 '15 at 23:52

2 Answers2

13

There is a way to emulate this operation, but it is not very beautiful:

const __m256i K0 = _mm256_setr_epi8(
    0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70,
    0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0);

const __m256i K1 = _mm256_setr_epi8(
    0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,
    0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70);

inline const __m256i Shuffle(const __m256i & value, const __m256i & shuffle)
{
    return _mm256_or_si256(_mm256_shuffle_epi8(value, _mm256_add_epi8(shuffle, K0)), 
        _mm256_shuffle_epi8(_mm256_permute4x64_epi64(value, 0x4E), _mm256_add_epi8(shuffle, K1)));
}
ErmIg
  • 3,980
  • 1
  • 27
  • 40
  • It isn't beautiful, but it works in my case. Thank you. –  Jun 08 '15 at 05:19
  • 1
    Actually it's quite beautiful to me. – BeeOnRope Jun 23 '16 at 01:36
  • Working awesome even after translation into .NET 5 C#. `private static Vector256 Shuffle(Vector256 value, Vector256 shuffle) => Avx2.Or(Avx2.Shuffle(value, Avx2.Add(shuffle, K0)), Avx2.Shuffle(Avx2.Permute4x64(value.AsInt64(), 0x4E).AsByte(), Avx2.Add(shuffle, K1)));` Thank you! – aepot Jan 02 '21 at 18:48
1

First - a clarification - the usual specification of Intel requires that the shuffle pattern be defined in bits 0-3 in each byte for each byte. Since you seek to do a cross lane shuffle, your shuffle pattern uses the bit 4 as well, to represent bytes located in location index above 15 in the YMM register.

Assumptions : what you want to shuffle is in YMM0, and the pattern is in YMM1.

The code is as below :

mask_pattern_0  db      0FH
mask_pattern_1  db      10H

vpbroadcastb    ymm2,byte ptr mask_pattern_0    ; Load the mask
vmovdqu     ymm5,ymm2   
vpsubb      ymm3,ymm2,ymm1              ; YMM3 has neg for all those exceeding 15 in original shuffle pattern
vpsignb     ymm4,ymm1,ymm3              ; YMM4 replicates shuffle pattern with a neg at all those that are above 15 in the original shuffle pattern
vperm2i128  ymm2,ymm0,ymm0,00010001b    ; Save the upper 128 bits of the target YMM0 to YMM2 in both upper and lower 128 bits
vperm2i128  ymm0,ymm0,ymm0,00100000b    ; This replicates the lower 128 bits of YMM0 to upper 128 bits of YMM0
vpshufb     ymm0,ymm0,ymm4              ; This places all those with index below 16 to appropriate place, and sets a zero to other bytes
;We now process the entries in shuffle pattern with index above 15
vpsubb      ymm3,ymm1,ymm5              ; Now all those above 15 have a positive value
vpsignb     ymm4,ymm1,ymm3              ; YMM4 has negatives for all those below 15 in original shuffle pattern YMM1
vpbroadcastb    ymm5,byte ptr mask_pattern_1    ; Load the mask value 10H
vpsubb      ymm4,ymm4,ymm5
vpshufb     ymm2,ymm2,ymm4              ; Save the shuffle in YMM2
vpaddb      ymm0,ymm0,ymm2

This also ensures that the pattern contained in YMM1 is untouched - as is true of VPSHUFB instruction.

Trust this helps...

quasar66
  • 555
  • 4
  • 14