I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient.
How can the complex number horizontal sum be performed using AVX (or AVX2) intrinsics? I would accept an answer using assembly if there is not an answer with comparable efficiency using intrinsics.
Edit: To clarify, if the register contains AR, AI, BR, BI, CR, CI, DR, DI, I want to compute the complex number (AR + BR + CR + DR, AI + BI + CI + DI). If the result is in a 256 bit register, I can extract the 2 single precision floating point numbers.
Edit2: Potential solution, though not necessarily optimal...
float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float sumReal = 0.0;
float sumImaginary = 0.0;
__m256i mask = _mm256_set_epi32 (7, 5, 3, 1, 6, 4, 2, 0);
// Separate real and imaginary.
__m256 permutedSum = _mm256_permutevar8x32_ps(sseSum0, mask);
__m128 realSum = _mm256_extractf128_ps(permutedSum , 0);
__m128 imaginarySum = _mm256_extractf128_ps(permutedSum , 1);
// Horizontally sum real and imaginary.
sumReal = hsum_ps_sse3(realSum);
sumImaginary = hsum_ps_sse3(imaginarySum);