5

Problem: I have 4 x 256-bit AVX2 vectors (A, B, C, D) and I need to perform a swaping operation of their respective 128-bit parts and between two different vectors. Here is the transformation I need to do.

             Original                      Transformed
    || Low Lane || High Lane||     || Low Lane || High Lane||
A = ||    L1    ||    H1    || = > ||    L1    ||    L2    ||
B = ||    L2    ||    H2    || = > ||    H1    ||    H2    ||
C = ||    L3    ||    H3    || = > ||    L3    ||    L4    ||
D = ||    L4    ||    H4    || = > ||    H3    ||    H4    ||

Visualization

Basically I need to store output in the following order L1, L2, L3, L4, H1, H2, H3, H4 to an array.

My current solution is using:
4x _mm256_blend_epi32 (worst-case: latency 1, throughput 0.35)
4x _mm256_permute2x128_si256 (worst-case: latency 3, throughput 1)

// (a, c) = block0, (b, d) = block1
a = Avx2.Permute2x128(a, a, 1);
var template = Avx2.Blend(a, b, 0b1111_0000); // H1 H2
a = Avx2.Blend(a, b, 0b0000_1111); // L2 l1
a = Avx2.Permute2x128(a, a, 1); // L1 l2
b = template;

c = Avx2.Permute2x128(c, c, 1);
template = Avx2.Blend(c, d, 0b1111_0000); // H3 H4
c = Avx2.Blend(c, d, 0b0000_1111);  // L4 L3
c = Avx2.Permute2x128(c, c, 1); // L3 l4
d = template;

// Store keystream into buffer (in corrected order = [block0, block1])
Avx2.Store(outputPtr, a);
Avx2.Store(outputPtr + Vector256<uint>.Count, c);
Avx2.Store(outputPtr + Vector256<uint>.Count * 2, b);
Avx2.Store(outputPtr + Vector256<uint>.Count * 3, d);

Note: I'm using C#/NetCore to do AVX2 if you are wondering! Feel free to use examples in C/C++.

Is there any better or more efficient way to do it?

Edit

Accepted answer as C#

var tmp = Avx2.Permute2x128(a, b, 0x20);
b = Avx2.Permute2x128(a, b, 0x31);
a = tmp;
tmp = Avx2.Permute2x128(c, d, 0x20);
d = Avx2.Permute2x128(c, d, 0x31);
c = tmp;
xtremertx
  • 121
  • 8

2 Answers2

5

If I understand you correctly, I think you could get away without the blend instructions for this 2x4 transpose, creating new variables that pick the lanes you want. Something like:

__m256i a;    // L1 H1
__m256i b;    // L2 H2
__m256i c;    // L3 H3
__m256i d;    // L4 H4

__m256i A = _mm256_permute2x128_si256(a, b, 0x20);  // L1 L2
__m256i B = _mm256_permute2x128_si256(a, b, 0x31);  // H1 H2
__m256i C = _mm256_permute2x128_si256(c, d, 0x20);  // L3 L4
__m256i D = _mm256_permute2x128_si256(c, d, 0x31);  // H3 H4

You still have the 3-cycle latency of the vperm2i128 instruction, but you always have that when you have data crossing 128-bit lanes. These 4 shuffles are independent so they can pipeline (ILP); Intel and Zen 2 have 1/clock throughput for vperm2i128 (https://agner.org/optimize/, https://uops.info/).

If you're lucky, a compiler will optimize the L1,L2 and L3,L4 shuffles into vinserti128 which AMD Zen 1 runs much more efficiently (1 uop instead of 8; lane-crossing shuffles get split into multiple 128-bit uops.)


These 4 shuffles take 4 uops for the shuffle port (port 5 on Intel); Intel and Zen2 have only 1/clock shuffle throughput for these shuffles. If that would be a bottleneck in your loop, consider @chtz's answer which costs more front-end throughput by doing 2 shuffles to line up the 4 lanes that need to move in preparation for cheap blends (vpblendd). Related: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jason R
  • 11,159
  • 6
  • 50
  • 81
  • That is a great solution eliminating 4x blend ops! Still learning AVX2 set.. – xtremertx May 18 '20 at 12:46
  • Can I use _mm256_permute2x128_si256 instead of _mm256_permute2f128_ps? Or there is a specific reason for it? Ok, it work with the first variant too... – xtremertx May 18 '20 at 12:53
  • No, you can use the integer version. I misread your original example and didn't catch the fact that you had integer data. I'll edit. – Jason R May 18 '20 at 12:54
3

You can do your operation with two permutes and 4 blends, giving an absolute throughput of 2 cycles:

void foo(
    __m256i a,    // L1 H1
    __m256i b,    // L2 H2
    __m256i c,    // L3 H3
    __m256i d,    // L4 H4
    __m256i* outputPtr
)
{
    // permute. Port usage: 1*p5, Latency 3 on both inputs
    __m256i BA = _mm256_permute2x128_si256(a, b, 0x21);  // H1 L2 
    __m256i DC = _mm256_permute2x128_si256(c, d, 0x21);  // H3 L4

    // blend. Port usage: 1*p015, Latency 1 on both inputs
    __m256i A = _mm256_blend_epi32(a, BA, 0xf0);  // L1 L2
    __m256i B = _mm256_blend_epi32(BA, b, 0xf0);  // H1 H2
    __m256i C = _mm256_blend_epi32(c, DC, 0xf0);  // L3 L4
    __m256i D = _mm256_blend_epi32(DC, d, 0xf0);  // H3 H4

    _mm256_store_si256(outputPtr+0, A);
    _mm256_store_si256(outputPtr+1, B);
    _mm256_store_si256(outputPtr+2, C);
    _mm256_store_si256(outputPtr+3, D);
}

However, depending on context (especially if a, ..., d are originally read from memory), it may also be better to use a sequence of vmovdqu and vinserti128 instructions with m128 memory operands. You'll have twice as many loads, but no interlane latency and no bottle-neck on port 5 -- regarding latency and port-usage a memory-based vinsert128 behaves like a blend.

chtz
  • 17,329
  • 4
  • 26
  • 56
  • Interesting, as I'm very new to code vectorization, I'm not familiar with term "ports", can you elaborate? I'm reading from memory using broutcast. `uint* state = new uint[32] { 0, 1, 2, 3, 20, 21, 22, 23, 4, 5, 6, 7, 24, 25, 26, 27, 8, 9, 10, 11, 28, 29, 30, 31, 12, 13, 14, 15, 32, 33, 34, 35 }; a = Avx2.BroadcastVector128ToVector256(state); b = Avx2.BroadcastVector128ToVector256(state + Vector128.Count); c = Avx2.BroadcastVector128ToVector256(state + Vector128.Count * 2); d = Avx2.BroadcastVector128ToVector256(state + Vector128.Count * 3);` – xtremertx May 18 '20 at 16:55
  • I have a reason to use "broadcast" as I'm calculating 2 blocks of chacha20 cipher simultaneously, just to clarify. Also how do you calculate absolute throughput? `giving an absolute throughput of 2 cycles` – xtremertx May 18 '20 at 17:09
  • 1
    @xtremertx: I added links to Jason's answer with sources for the 3 cycle latency number. "Thoughput" only matters for the entire block including surrounding code; if you still wouldn't bottleneck on shuffle-port throughput but instead on front-end uop throughput (or worse, on latency), then use the 4-instruction way in Jason's answer instead of the 6 instruction way in this answer. e.g. if you have a lot of AND / OR / shift work between these shuffle steps in a loop, probably optimize for fewer instructions. – Peter Cordes May 18 '20 at 17:34
  • 1
    @xtremertx Is the broadcast-load happening directly before the "transposing"? Or are there instructions happening in-between? Also, just that I understand the C#-AVX syntax correctly: After the broadcast `a={0,1,2,3, 0,1,2,3}`, `b={20,21,22,23, 20,21,22,23}` and so on? – chtz May 18 '20 at 19:40
  • @chtz Yes, there are instructions between broadcast-loads and "transposing", multiple Add, Xor, LogicalShift calls, like: `a = Avx2.Add(a, b); d = Avx2.Xor(d, a); d = Avx2.ShiftLeftLogical(d, 16); ` and so on... Yes, you are correct about "a={0,1,2,3, 0,1,2,3}, b={20,21,22,23, 20,21,22,23}...." – xtremertx May 18 '20 at 19:56
  • 1
    @chtz There are most likely multiple solutions how to vectorize chacha20 using AVX2, however such discussion would require a new topic. I'm using following [paper-pdf](https://eprint.iacr.org/2013/759.pdf) that explains some things in a greater detail and at the time of release they had better performance than chromium project. I'm basically computing 2 blocks of keystream simultatenously using AVX2 (they call it double-quaterround in the paper). – xtremertx May 18 '20 at 20:20
  • I agree, we should leave this question as it is. – chtz May 18 '20 at 20:24