Problem: I have 4 x 256-bit AVX2 vectors (A, B, C, D) and I need to perform a swaping operation of their respective 128-bit parts and between two different vectors. Here is the transformation I need to do.
Original Transformed
|| Low Lane || High Lane|| || Low Lane || High Lane||
A = || L1 || H1 || = > || L1 || L2 ||
B = || L2 || H2 || = > || H1 || H2 ||
C = || L3 || H3 || = > || L3 || L4 ||
D = || L4 || H4 || = > || H3 || H4 ||
Basically I need to store output in the following order L1, L2, L3, L4, H1, H2, H3, H4 to an array.
My current solution is using:
4x _mm256_blend_epi32 (worst-case: latency 1, throughput 0.35)
4x _mm256_permute2x128_si256 (worst-case: latency 3, throughput 1)
// (a, c) = block0, (b, d) = block1
a = Avx2.Permute2x128(a, a, 1);
var template = Avx2.Blend(a, b, 0b1111_0000); // H1 H2
a = Avx2.Blend(a, b, 0b0000_1111); // L2 l1
a = Avx2.Permute2x128(a, a, 1); // L1 l2
b = template;
c = Avx2.Permute2x128(c, c, 1);
template = Avx2.Blend(c, d, 0b1111_0000); // H3 H4
c = Avx2.Blend(c, d, 0b0000_1111); // L4 L3
c = Avx2.Permute2x128(c, c, 1); // L3 l4
d = template;
// Store keystream into buffer (in corrected order = [block0, block1])
Avx2.Store(outputPtr, a);
Avx2.Store(outputPtr + Vector256<uint>.Count, c);
Avx2.Store(outputPtr + Vector256<uint>.Count * 2, b);
Avx2.Store(outputPtr + Vector256<uint>.Count * 3, d);
Note: I'm using C#/NetCore to do AVX2 if you are wondering! Feel free to use examples in C/C++.
Is there any better or more efficient way to do it?
Edit
Accepted answer as C#
var tmp = Avx2.Permute2x128(a, b, 0x20);
b = Avx2.Permute2x128(a, b, 0x31);
a = tmp;
tmp = Avx2.Permute2x128(c, d, 0x20);
d = Avx2.Permute2x128(c, d, 0x31);
c = tmp;