C intrinsics, SSE2 dot product and gcc -O3 generated assembly

Question

I need to write a dot product using SSE2 (no _mm_dp_ps nor _mm_hadd_ps) :

#include <xmmintrin.h>

inline __m128 sse_dot4(__m128 a, __m128 b)
{
    const __m128 mult = _mm_mul_ps(a, b);
    const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1));
    const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2));
    const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3));

    return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3);
}

but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get :

    mulps   %xmm1, %xmm0
    movaps  %xmm0, %xmm3         //These lines
    movaps  %xmm0, %xmm2         //have no use
    movaps  %xmm0, %xmm1         //isn't it ?
    shufps  $57, %xmm0, %xmm3
    shufps  $78, %xmm0, %xmm2
    shufps  $147, %xmm0, %xmm1
    addss   %xmm3, %xmm0
    addss   %xmm2, %xmm0
    addss   %xmm1, %xmm0
    ret

I am wondering why gcc copy xmm0 in xmm1, 2 and 3... Here is the code I get using the flag : -march=native (looks better)

    vmulps  %xmm1, %xmm0, %xmm1
    vshufps $78, %xmm1, %xmm1, %xmm2
    vshufps $57, %xmm1, %xmm1, %xmm3
    vshufps $147, %xmm1, %xmm1, %xmm0
    vaddss  %xmm3, %xmm1, %xmm1
    vaddss  %xmm2, %xmm1, %xmm1
    vaddss  %xmm0, %xmm1, %xmm0
    ret

Are you calling this function in a loop or are you really only just doing a 4 point dot product ? If you're doing it in a loop then see this answer: http://stackoverflow.com/a/17001365/253056 and replace `_mm_hadd_ps` with scalar code. — Paul R, Jun 08 '13 at 16:13
My compiler (not gcc) generates the same kind of code, strange coincidence. I don't see any hint that SHUFPS might be faster if it uses two distinct registers. Maybe it is on older processors. — Hans Passant, Jun 08 '13 at 16:33
My processor is not so old : i5-2450M (so it has sse4.2 and avx but it's for a "portable" version). I am just compiling the code I gave : gcc dot.c -O3 -S -o dot.s. So no loop is involved. — matovitch, Jun 08 '13 at 16:39
The compiler doesn't pay attention to your processor. The code also needs to run on another machine. — Hans Passant, Jun 08 '13 at 17:59
You can get gcc to pay attention to your processor while still making code that can run anywhere: `-mtune=native`. It doesn't usually make a lot of difference, but at least when tuning for Intel CPUs, it tries harder to keep compare-and-branch together for macro-fusion and doesn't waste space on `rep ret`. AMD CPUs fuse `test` and `cmp` with branch instructions too, so gcc *should* always be doing that unless specifically tuning for a CPU without that feature. (And then yes, putting instructions between the flag producer and consumer might help the OOO engine). — Peter Cordes, Jan 29 '16 at 16:30

score 5 · Answer 1 · answered Jun 08 '13 at 22:44

Here's a dot product using only original SSE instructions, that also swizzles the result across each element:

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1));
    v0 = _mm_add_ps(v0, v1);
    v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3));
    v0 = _mm_add_ps(v0, v1);

    return v0;
}

It's 5 SIMD instructions (as opposed to 7), though with no real opportunity to hide latencies. Any element will hold the result, e.g., float f = _mm_cvtss_f32(sse_dot4(a, b);

the haddps instruction has pretty awful latency. With SSE3:

inline __m128 sse_dot4(__m128 v0, __m128 v1)
{
    v0 = _mm_mul_ps(v0, v1);

    v0 = _mm_hadd_ps(v0, v0);
    v0 = _mm_hadd_ps(v0, v0);

    return v0;
}

This is possibly slower, though it's only 3 SIMD instructions. If you can do more than one dot product at a time, you could interleave instructions in the first case. Shuffle is very fast on more recent micro-architectures.

Thanks, nevertheless in my case the 3 shuffles are independents and addss must be faster than addps (one more shuffle is needed to swizzle the result across each element). This deserve a benchmark. — matovitch, Jun 08 '13 at 23:22
@matovitch *"addss must be faster than addps"* - Isn't the whole point of SSE that `addps` is *not* slower than `addss`? — Christian Rau, Jun 10 '13 at 11:05

Guillaume · Answer 2 · 2013-06-09T01:00:45.033

4

The first listing you paste is for SSE architectures only. Most SSE instructions support only the two operand syntax: instructions are in the form of a = a OP b.

In your code, a is mult. So if no copy is made and passes mult (xmm0 in your example) directly, its value will be overwritten and then lost for the remaining _mm_shuffle_ps instructions

By passing march=native in the second listing, you enabled AVX instructions. AVX enables SSE intructions to use the three operand syntax: c = a OP b. In this case, none of the source operands has to be overwritten so you do not need the additional copies.

edited Jun 09 '13 at 01:00

answered Jun 08 '13 at 17:50

Guillaume

2,044
12
11

I do not understand. As you say `mult` is in `xmm0` but the first shuffle do not ovewrite `xmm0` but `xmm3`, isn't it ? – matovitch Jun 08 '13 at 18:58
@matovitch yes but that's only because you copied xmm0 to xmm3 that you can do this. Otherwise the operation would be `xmm0 = xmm0 SHUFFLE _MM_SHUFFLE(0, 3, 2, 1)` and then your initial value of `mult` is gone. So the compiler does a copy before the operation – Guillaume Jun 08 '13 at 19:04
I get it ! Shuffle needs 2 xmm-registers so it has to copy xmm0 into the register it will ovewrite. Thanks ! (wait..is it what you wanted to say ?) – matovitch Jun 08 '13 at 19:07
@matovitch Yes unless you're using AVX like in your 2nd example when you pass `-march=native` – Guillaume Jun 08 '13 at 19:08
But since I compute the dot product on the first coordinate (see addss), I think, this should work without loading xmm0 into xmm1..3...maybe...but the compiler has no way to understand this. I will try. – matovitch Jun 08 '13 at 19:12
@matovitch you should try to write down the instructions if you're not convinced. If you use pure SSE and not AVX, the register operand of the shuffle will be overwritten (it's a source and a destination operand). If the compiler does not do a copy, the value of the source operand will be lost. But your code needs it, you keep reusing `mult` – Guillaume Jun 08 '13 at 19:17
I try. And indeed I need these copy. I don't know the shuffle instruction well enougth to understand why but I will digg into this. (did you say that in english ? ;-) – matovitch Jun 08 '13 at 19:25
@matovich It's not just the shuffle instructions. It's most SSE instructions: one source operand is also the destination operand. So if you want to keep the initial value of that source operand you need to copy it prior to the instruction. Once again, this was changed in AVX (shufps becomes vshufps), the old SSE instructions allow 2 read-only source operands so you do not need this copy – Guillaume Jun 08 '13 at 19:28
Ok. Thank you very much. Now my dot product seems quite inefficient to me. I'm sure there is a better way to do this... – matovitch Jun 08 '13 at 19:40
@matovitch *"I'm sure there is a better way to do this"* - No there isn't. Really, that register-register copy isn't really a problem *and you cannot do it without it, that's just how the instruction set is* (except if you're using *AVX*, of course, or if you have *AVX*, then you also have *SSE3+* and can just use `haddps` or `dpps`). – Christian Rau Jun 10 '13 at 11:02

score 4 · Answer 3 · 2016-10-08T10:11:05.307

Let me suggest that if you're going to use SIMD to do a dot product then you try and find a way to operate on multiple vectors at once. For example with SSE if you have four vectors and you want to take the dot product with a fixed vector then you arrange the data like (xxxx), (yyyy), (zzzz), (wwww) and add each SSE vector and get the result of four dot products at once. That will get your at 100% (four times speedup) efficiency and it's not limited to 4-component vectors, it is 100% efficient for n-component vectors as well. Here is an example which only uses SSE.

#include <xmmintrin.h>
#include <stdio.h>

void dot4x4(float *aosoa, float *b, float *out) {   
    __m128 vx = _mm_load_ps(&aosoa[0]);
    __m128 vy = _mm_load_ps(&aosoa[4]);
    __m128 vz = _mm_load_ps(&aosoa[8]);
    __m128 vw = _mm_load_ps(&aosoa[12]);
    __m128 brod1 = _mm_set1_ps(b[0]);
    __m128 brod2 = _mm_set1_ps(b[1]);
    __m128 brod3 = _mm_set1_ps(b[2]);
    __m128 brod4 = _mm_set1_ps(b[3]);
    __m128 dot4 = _mm_add_ps(
        _mm_add_ps(_mm_mul_ps(brod1, vx), _mm_mul_ps(brod2, vy)),
        _mm_add_ps(_mm_mul_ps(brod3, vz), _mm_mul_ps(brod4, vw)));
    _mm_store_ps(out, dot4);

}

int main() {
    float *aosoa = (float*)_mm_malloc(sizeof(float)*16, 16);
    /* initialize array to AoSoA vectors v1 =(0,1,2,3}, v2 = (4,5,6,7), v3 =(8,9,10,11), v4 =(12,13,14,15) */
    float a[] = {
        0,4,8,12,
        1,5,9,13,
        2,6,10,14,
        3,7,11,15,
    };
    for (int i=0; i<16; i++) aosoa[i] = a[i];

    float *out = (float*)_mm_malloc(sizeof(float)*4, 16);
    float b[] = {1,1,1,1};
    dot4x4(aosoa, b, out);
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);

    _mm_free(aosoa);
    _mm_free(out);
}

Indeed, but in my case I wanted to to a `mm_mul_ps` after (so I needed an other shuffle or `_mm_dp_ps(a, b, 0xff)` . Thanks for the great and detailed example though. — matovitch, Jun 12 '13 at 17:39

score 1 · Accepted Answer · answered Sep 08 '14 at 14:52

1

(In fact, and despite all the up-votes, the answers that were given at the time this question was posted did not fulfill the expectations I had. Here is the answer I was waiting for.)

The SSE instruction

shufps $IMM, xmmA, xmmB

does not work as

xmmB = f($IMM, xmmA) 
//set xmmB with xmmA's words shuffled according to $IMM

but as

xmmB = f($IMM, xmmA, xmmB) 
//set xmmB with 2 words of xmmA and 2 words of xmmB according to $IMM

this is why the copy of the mulps result from xmm0 to xmm1..3 is needed.

answered Sep 08 '14 at 14:52

matovitch

1,264
11
26

Note that gcc could have used `pshufd` to copy-and-shuffle, but on some CPUs that would result in multiple cycles of bypass delay (extra latency). On Intel IvB and later, which can handle reg-reg `mov` instructions at the register-rename stage, the `mov` instructions have zero latency, and `pshufd` would save a `movaps` but add 1 cycle of latency. On SnB, it would be pure gain: the mov has latency, so you're trading one cycle of mov latency for one cycle of bypass delay. – Peter Cordes Jan 29 '16 at 16:26
On AMD Bulldozer, even FP shuffles run in the `ivec` domain, so for that CPU there's no downside to using `pshufd` and saving the (zero-latency) `movaps`. – Peter Cordes Jan 29 '16 at 16:28

C intrinsics, SSE2 dot product and gcc -O3 generated assembly

4 Answers4