I'm doing interpolation on data returned by a A/D in the form of sequential 12 bit samples packed in 16 bit values A1B1A2B2... My c program works but I'd like to make it faster using AVX2 (remaining Rocketlake and Skylake compatible). My interpolator takes 4 sequential values as inputs which are loaded as uint16 and then processed as floating point:
//channel 1 input samples - 32 bit aligned
UINT16 a3 = linedata[pos - 6] & 0x0FFF;
UINT16 a2 = linedata[pos - 4] & 0x0FFF;
UINT16 a1 = linedata[pos - 2] & 0x0FFF;
UINT16 a0 = linedata[pos] & 0x0FFF;
//channel 2 input samples
UINT16 b3 = linedata[pos - 5] & 0x0FFF;
UINT16 b2 = linedata[pos - 3] & 0x0FFF;
UINT16 b1 = linedata[pos - 1] & 0x0FFF;
UINT16 b0 = linedata[pos + 1] & 0x0FFF;
To vectorize this I run 4 pairs of AB samples at a time (which may be overlapping if the interpolated values are close), aiming to fill up 256 bit AVX registers corresponding to the 8 unique samples for the 0, 1, 2 and 3 interpolate coefficients like so:
While processing and saving was straightforward, I couldn't figure out an efficient way to unpack the 16 bit samples and then shuffle them into the vectors. I ended up having to load them into vectors that contain 4 pairs of A/B coefficients for the same output sample and then do multiple shuffle/permute cycles to convert to the final shape:
//load all 8 coefficients for each sample pair at once
__m128i zero2 = _mm_lddqu_si128((__m128i const*)&(linedata[pos - 6]));
__m128i zero2_2 = _mm_lddqu_si128((__m128i const*)&(linedata[pos2 - 6]));
__m128i zero2_3 = _mm_lddqu_si128((__m128i const*) & (linedata[pos3 - 6]));
__m128i zero2_4 = _mm_lddqu_si128((__m128i const*) & (linedata[pos4 - 6]));
//now convert u16 to to 32 bit ints
__m256i zero3 = _mm256_cvtepu16_epi32(zero2);
__m256i zero3_2 = _mm256_cvtepu16_epi32(zero2_2);
__m256i zero3_3 = _mm256_cvtepu16_epi32(zero2_3);
__m256i zero3_4 = _mm256_cvtepu16_epi32(zero2_4);
//and off the upper 4 bits to be safe since we have 12 bit values
zero3 = _mm256_and_si256(zero3, constant_mask);
zero3_2 = _mm256_and_si256(zero3_2, constant_mask);
zero3_3 = _mm256_and_si256(zero3_3, constant_mask);
zero3_4 = _mm256_and_si256(zero3_4, constant_mask);
//convert to float32
__m256 A = _mm256_cvtepi32_ps(zero3);
__m256 B = _mm256_cvtepi32_ps(zero3_2);
__m256 C = _mm256_cvtepi32_ps(zero3_3);
__m256 D = _mm256_cvtepi32_ps(zero3_4);
//shuffle and then permute into separate vectors
__m256 tempshuffle1 = _mm256_shuffle_ps(C, A, _MM_SHUFFLE(3, 2, 3, 2));
__m256 tempshuffle2 = _mm256_shuffle_ps(D, B, _MM_SHUFFLE(3, 2, 3, 2));
__m256 temppermute1 = _mm256_permutevar8x32_ps(tempshuffle1, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 temppermute2 = _mm256_permutevar8x32_ps(tempshuffle2, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 zero = _mm256_shuffle_ps(temppermute2, temppermute1, _MM_SHUFFLE(3, 2, 3, 2));
__m256 two = _mm256_shuffle_ps(temppermute2, temppermute1, _MM_SHUFFLE(1, 0, 1, 0));
__m256 tempshuffle3 = _mm256_shuffle_ps(C, A, _MM_SHUFFLE(1, 0, 1, 0));
__m256 tempshuffle4 = _mm256_shuffle_ps(D, B, _MM_SHUFFLE(1, 0, 1, 0));
__m256 temppermute3 = _mm256_permutevar8x32_ps(tempshuffle3, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 temppermute4 = _mm256_permutevar8x32_ps(tempshuffle4, _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0));
__m256 one = _mm256_shuffle_ps(temppermute4, temppermute3, _MM_SHUFFLE(3, 2, 3, 2));
__m256 three = _mm256_shuffle_ps(temppermute4, temppermute3, _MM_SHUFFLE(1, 0, 1, 0));
//zero, one, two and three vectors now contain values to be interpolated with from 8 independent samples (4 pairs)
This ends up being about 3 times as fast as the c version, but looking at the assembly, the above compiles into something longer than the actual processing and code for writing out data combined. Is there a more efficient strategy I can use for loading and shuffling the data into vectors?