2

I'm using intrinsics to optimize a program of mine. But now I would like to sum the four elements that are in a __m128 vector in order to compare the result to a floating point value. For instance, let's say I have this 128 bits vector : {a, b c, d}. How can I compare a+b+c+d to e, where e is of type float ?

Does SSE2 or SSE3 provide a way to do that simply or do you have any code snippet that could help me ? Thanks !

Merkil
  • 23
  • 3
  • You could do that in SSE3 with two HADDPS's, but that's not very fast. Where does this vector come from? Can whatever calculates it be rearranged such that a horizontal addition could be avoided? – harold Apr 15 '12 at 16:14
  • Well I must compare (a*a + b*b) and 4.0. To do this, I've stored a and b in a __m128 vector, such as vec = {a, b, UNUSED, UNUSED}. I obtain {a², b², X, X} by doing square = _mm_mul_ps(vec, vec). And now, I'm searching for a way to get a² + b² so that I can compare it to 4.0. That's certainly not optimal, so if you have any advice, it would be much appreciated :) – Merkil Apr 15 '12 at 16:30
  • I don't have SSE4 support on my processor, sorry. – Merkil Apr 15 '12 at 16:52
  • That's a shame, DPPS was a real good fit for this problem. Ok, I'll have to think about it. – harold Apr 15 '12 at 16:53

1 Answers1

1

The best I can up with is this:

; assumes    xmm0 = [0, B, 0, A] or similar
mulps xmm0,xmm0   ; [0, B*B, 0, A*A]
xorps xmm1,xmm1
movhlps xmm1,xmm0 ; [0, 0, 0, B * B]
addps xmm0,xmm1   ; [0, 0, 0, A * A + B * B]

If A and B absolutely have to be in the low quadword then as far as I can tell you need a shuffle, which is slower on pre-Penryn (and on a Penryn the DPPS solution is available).

harold
  • 61,398
  • 6
  • 86
  • 164
  • Thanks a lot. But now that I have this vector, how to compare it to 4. ? Should I create a vector that would contain {0, 0, 0, 4} and compare them with _mm_cmpeq_ss ? – Merkil Apr 15 '12 at 17:34
  • 1
    If you want, but this is floating point so it doesn't really mix well with the EQ variant. What's it for? Can the comparison be replaced by a LE or NLT variant? – harold Apr 15 '12 at 17:41
  • Well in fact, I could as well use LE. Thanks for your help ! – Merkil Apr 15 '12 at 17:46