I try to optimize by code for different SIMD architectures. What is best way to calculate the following:
For SSE:
float s = something
__m128 v = calculation result
s -= v[0] + v[1] + v[2] + v[3]
At the moment I calculate the horizontal sum by:
__m128 sum = _mm_hadd_ps( v, v )
sum = _mm_hadd_ps( sum, sum )
s -= _mm_cvtss_f32( sum )
Is there some cool optimization possible ?