0

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient.

How can the complex number horizontal sum be performed using AVX (or AVX2) intrinsics? I would accept an answer using assembly if there is not an answer with comparable efficiency using intrinsics.

Edit: To clarify, if the register contains AR, AI, BR, BI, CR, CI, DR, DI, I want to compute the complex number (AR + BR + CR + DR, AI + BI + CI + DI). If the result is in a 256 bit register, I can extract the 2 single precision floating point numbers.

Edit2: Potential solution, though not necessarily optimal...

float hsum_ps_sse3(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

float sumReal = 0.0;
float sumImaginary = 0.0;

 __m256i mask = _mm256_set_epi32 (7, 5, 3, 1, 6, 4, 2, 0);

 // Separate real and imaginary.
__m256 permutedSum = _mm256_permutevar8x32_ps(sseSum0, mask);
__m128 realSum = _mm256_extractf128_ps(permutedSum , 0);
__m128 imaginarySum = _mm256_extractf128_ps(permutedSum , 1);

// Horizontally sum real and imaginary.
sumReal = hsum_ps_sse3(realSum);
sumImaginary = hsum_ps_sse3(imaginarySum);
user1777820
  • 728
  • 9
  • 29
  • We're a code site; you could explain what you want more easily with a simple example. That said, I'm missing the problem. Yes, AVX is an insane design with its 2 lanes, but in this case that doesn't hurt. Shuffle the real parts to one lane, imaginary components to the other lane (bits 0:127 and 128:255), do a horizontal add within the lane. Results end up in 0:31 and 128:159. – MSalters Jul 12 '16 at 14:50
  • What is your goal? Maybe you want to consider storing eight complex numbers in two AVX registers: the real parts in one and the imaginary parts in another. It really depends on what your goal is. – Z boson Jul 13 '16 at 06:50
  • @Zboson My goal is exactly as stated in the question. The data is stored such that real and imaginary values are alternating and this cannot be changed. The solution I proposed in my edit separates the real and imaginary components in order to sum them, but the input data will be alternating real and imaginary. – user1777820 Jul 13 '16 at 14:26

1 Answers1

3

One fairly straightforward solution which requires only AVX (not AVX2):

__m128i v0 = _mm256_castps256_ps128(v);      // get low 2 complex values
__m128i v1 = _mm256_extractf128_ps(v, 1);    // get high 2 complex values
v0 = _mm_add_ps(v0, v1);                     // add high and low
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(1, 0, 3, 2));
v0 = _mm_add_ps(v0, v1);                     // combine two halves of result

The result will be in v0 as { sum.re, sum.im, sum.re, sum.im }.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Weird - something to do with CR/LF line endings maybe ? – Paul R Jul 13 '16 at 06:28
  • Hahaha...I am using a Random Agent Spoofer in Firefox which I recently installed to try out. Apparently, the profile it randomly choose did that (I think it was IE10). When I choose a new random profile it shows up correctly. – Z boson Jul 13 '16 at 06:39
  • Aha - that's a relief then - I hit [edit] earlier and couldn't see anything wrong with the post. – Paul R Jul 13 '16 at 09:04