SSE slower than FPU?

Question

I have a large piece of code, part of whose body contains this piece of code:

result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);

which I have vectorized as follows (everything is already a float):

__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
                      _mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
    _mm_extract_ps(r,0), _mm_extract_ps(r,1),
    _mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);

The result is correct; however, my benchmarking shows that the vectorized version is slower:

The non-vectorized version takes 3750 ms
The vectorized version takes 4050 ms
Setting result to 0 directly (and removing this part of the code entirely) reduces the entire process to 2500 ms

Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?

(I'm on a mobile Core i5.)

Been a while since I've seen an SSE question on SO. I guess everybody is getting back from vacation. :) — Mysticial, Jan 13 '12 at 08:03

Paul R · Accepted Answer · 2012-01-13T07:59:50.270

17

You are spending a lot of time moving scalar values to/from SSE registers with _mm_set_ps and _mm_extract_ps - this is generating a lot of instructions, the execution time of which will far outweigh any benefit from using _mm_mul_ps. Take a look at the generated assembly output to see how much code is being generated in addition to the single MULPS instruction.

To vectorize this properly you need to use 128 bit SSE loads and stores (_mm_load_ps/_mm_store_ps) and then use SSE shuffle instructions to move elements around within registers where needed.

One further point to note - modern CPUs such as Core i5, Core i7, have two scalar FPUs and can issue 2 floating point multiplies per clock. The potential benefit from SSE for single precision floating point is therefore only 2x at best. It's easy to lose most/all of this 2x benefit if you have excessive "housekeeping" instructions, as is the case here.

edited Jan 13 '12 at 07:59

answered Jan 13 '12 at 07:53

Paul R

208,748
37
389
560

Huh, didn't realize moving values between registers was so slow! I was actually trying to *avoid* memory operations. Great to know, thanks a lot! :) +1 – user541686 Jan 13 '12 at 08:05
3

@Mehrdad It's slow because it's moving between registers in different domains (SSE-FP vs. general registers). There's usually an extra 1-2 cycle penalty for cross-domain data movement. – Mysticial Jan 13 '12 at 08:12
@Mysticial: Ooooo good point -- I totally forgot that the FPU register is quite a bit different from a general-purpose register. :) – user541686 Jan 13 '12 at 08:13

score 4 · Answer 2 · answered Jan 13 '12 at 08:02

There are several problems :

You will not see much benefits from using SSE instructions in such operations, because the SSE instructions are supposed to be better on parallel operations (that is, multiplying several values at the same time). What you did is a misuse of the SSE
do not set the values, use the pointer to the 1st value in the array, but then your values are not in the array
do not extract and copy values into the array. That is also a misuse of SSE. The result is supposed to be in an array.

score 0 · Answer 3 · answered Jan 13 '12 at 07:54

0

My take would be that the processor has the time to compute the first multiplication when using the FPU while loading the next values. The SSE has to load all the values first.

answered Jan 13 '12 at 07:54

Alexis Wilke

19,179
10
84
156

SSE slower than FPU?

3 Answers3

Linked