0

I try to optimize by code for different SIMD architectures. What is best way to calculate the following:

For SSE:

float  s = something  
__m128 v = calculation result  

s -= v[0] + v[1] + v[2] + v[3]

At the moment I calculate the horizontal sum by:

__m128 sum = _mm_hadd_ps( v, v )  
       sum = _mm_hadd_ps( sum, sum )  

s -= _mm_cvtss_f32( sum )

Is there some cool optimization possible ?

Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
Maik
  • 541
  • 4
  • 15
  • 1
    That's 3 instructions - it's hard to see how you'd beat it. – Paul R Sep 07 '14 at 07:41
  • 1
    As Paul said, there is no room for optimization since you only have 4 instructions. Do you have that embedded in a loop, or is that a pattern of a code that you have to vectorize? If so, could you perhaps show the whole code? – a3mlord Sep 08 '14 at 20:16
  • 1
    Like @a3mlord said, if that's inside a loop, only do the horizontal op at the end of the loop. Horizontal ops are significantly slower than in-lane ops. – Peter Cordes Jun 09 '15 at 02:47

0 Answers0