Efficient loading of packed 16 bit data using AVX2

Question

I'm doing interpolation on data returned by a A/D in the form of sequential 12 bit samples packed in 16 bit values A1B1A2B2... My c program works but I'd like to make it faster using AVX2 (remaining Rocketlake and Skylake compatible). My interpolator takes 4 sequential values as inputs which are loaded as uint16 and then processed as floating point:

        //channel 1 input samples - 32 bit aligned
        UINT16 a3 = linedata[pos - 6] & 0x0FFF;     
        UINT16 a2 = linedata[pos - 4] & 0x0FFF;
        UINT16 a1 = linedata[pos - 2] & 0x0FFF;
        UINT16 a0 = linedata[pos]     & 0x0FFF;

        //channel 2 input samples
        UINT16 b3 = linedata[pos - 5] & 0x0FFF;
        UINT16 b2 = linedata[pos - 3] & 0x0FFF;
        UINT16 b1 = linedata[pos - 1] & 0x0FFF;
        UINT16 b0 = linedata[pos + 1] & 0x0FFF;

To vectorize this I run 4 pairs of AB samples at a time (which may be overlapping if the interpolated values are close), aiming to fill up 256 bit AVX registers corresponding to the 8 unique samples for the 0, 1, 2 and 3 interpolate coefficients like so:

While processing and saving was straightforward, I couldn't figure out an efficient way to unpack the 16 bit samples and then shuffle them into the vectors. I ended up having to load them into vectors that contain 4 pairs of A/B coefficients for the same output sample and then do multiple shuffle/permute cycles to convert to the final shape:

//load all 8 coefficients for each sample pair at once
__m128i zero2 = _mm_lddqu_si128((__m128i const*)&(linedata[pos - 6]));
__m128i zero2_2 = _mm_lddqu_si128((__m128i const*)&(linedata[pos2 - 6]));
__m128i zero2_3 = _mm_lddqu_si128((__m128i const*) & (linedata[pos3 - 6]));
__m128i zero2_4 = _mm_lddqu_si128((__m128i const*) & (linedata[pos4 - 6]));

//now convert u16 to to 32 bit ints
__m256i zero3 = _mm256_cvtepu16_epi32(zero2);
__m256i zero3_2 = _mm256_cvtepu16_epi32(zero2_2);
__m256i zero3_3 = _mm256_cvtepu16_epi32(zero2_3);
__m256i zero3_4 = _mm256_cvtepu16_epi32(zero2_4);

//and off the upper 4 bits to be safe since we have 12 bit values
zero3 = _mm256_and_si256(zero3, constant_mask);
zero3_2 = _mm256_and_si256(zero3_2, constant_mask);
zero3_3 = _mm256_and_si256(zero3_3, constant_mask);
zero3_4 = _mm256_and_si256(zero3_4, constant_mask);

//convert to float32
__m256 A = _mm256_cvtepi32_ps(zero3);
__m256 B = _mm256_cvtepi32_ps(zero3_2);
__m256 C = _mm256_cvtepi32_ps(zero3_3);
__m256 D = _mm256_cvtepi32_ps(zero3_4);

//shuffle and then permute into separate vectors
__m256 tempshuffle1 = _mm256_shuffle_ps(C, A, _MM_SHUFFLE(3, 2, 3, 2));
__m256 tempshuffle2 = _mm256_shuffle_ps(D, B, _MM_SHUFFLE(3, 2, 3, 2));
__m256 temppermute1 = _mm256_permutevar8x32_ps(tempshuffle1, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 temppermute2 = _mm256_permutevar8x32_ps(tempshuffle2, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 zero = _mm256_shuffle_ps(temppermute2, temppermute1, _MM_SHUFFLE(3, 2, 3, 2));
__m256 two = _mm256_shuffle_ps(temppermute2, temppermute1, _MM_SHUFFLE(1, 0, 1, 0));

__m256 tempshuffle3 = _mm256_shuffle_ps(C, A, _MM_SHUFFLE(1, 0, 1, 0));
__m256 tempshuffle4 = _mm256_shuffle_ps(D, B, _MM_SHUFFLE(1, 0, 1, 0));
__m256 temppermute3 = _mm256_permutevar8x32_ps(tempshuffle3, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 temppermute4 = _mm256_permutevar8x32_ps(tempshuffle4, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 one = _mm256_shuffle_ps(temppermute4, temppermute3, _MM_SHUFFLE(3, 2, 3, 2));
__m256 three = _mm256_shuffle_ps(temppermute4, temppermute3, _MM_SHUFFLE(1, 0, 1, 0));

//zero, one, two and three vectors now contain values to be interpolated with from 8 independent samples (4 pairs)

This ends up being about 3 times as fast as the c version, but looking at the assembly, the above compiles into something longer than the actual processing and code for writing out data combined. Is there a more efficient strategy I can use for loading and shuffling the data into vectors?

Minor optimization: AND before PMOVZXWD (`_mm256_cvtepu16_epi32`) so you need half as many AND operations. And perhaps consider `_mm256_unpacklo_epi16` to interleave 16-bit data from two registers *before* widening to 32-bit. — Peter Cordes, Aug 14 '21 at 01:45
Also, instead of pmovzxwd / `_mm256_cvtepu16_epi32`, in-lane unpack with zero lets you recover the original order with `_mm256_packus_epi32` ([`vpackusdw`](https://www.felixcloutier.com/x86/packusdw)) unsigned saturation instead of inconvenient lane-crossing shuffles when you're eventually re-packing to 16-bit. This is nice if you don't really need data in-order in SIMD vectors, otherwise might not be usable. — Peter Cordes, Aug 14 '21 at 01:47
Err I guess if you're mixing before widening, perhaps `_mm_unpacklo_epi16`. Or `vinserti128` to get all the data into one vector? If it comes in pairs of 16-bit elements, then you can use stuff like `vpermd` before unpacking each element to 32-bit. — Peter Cordes, Aug 14 '21 at 02:01
@PeterCordes Excellent point about not needing to have the data in order in the SIMD vector. 4 permutes were wasted essentially just flipping BA back to AB when I could leave it backwards and just flip the pointers when I write it out to memory. I've simplified the code with your suggestion. I should think about that permute-shuffle-permute-shuffle block and see if I can do a further simplification. — user1850479, Aug 14 '21 at 02:35
Even replacing some of the shuffles with blends will help back-end throughput, since Intel CPUs (especially before Ice Lake) only have one execution port that can handle shuffle uops, but `vpblendd` can run on any of the 3 vector ALU ports. Also, memory-source `vinserti128` can get some shuffling done while you're loading, not needing port 5. Might be helpful to have a look at existing 4x4 and 8x8 transpose (of 32-bit elements) for inspiration. — Peter Cordes, Aug 14 '21 at 02:42

Efficient loading of packed 16 bit data using AVX2

0 Answers0