I have a large piece of code, part of whose body contains this piece of code:
result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);
which I have vectorized as follows (everything is already a float
):
__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
_mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
_mm_extract_ps(r,0), _mm_extract_ps(r,1),
_mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);
The result is correct; however, my benchmarking shows that the vectorized version is slower:
- The non-vectorized version takes 3750 ms
- The vectorized version takes 4050 ms
- Setting
result
to0
directly (and removing this part of the code entirely) reduces the entire process to 2500 ms
Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?
(I'm on a mobile Core i5.)