Most efficient way to store 4 dot products into a contiguous array in C using SSE intrinsics

Question

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

A portion of my program computes 4 dot products and adds each result to the previous values in a contiguous chunk of an array. More specifically,

tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

_mm_storeu_ps(C_2, tmp0);

Notice that I am going about this by using 4 temporary xmm registers to hold the result of each dot product. In each xmm register, the result is placed into a unique 32 bits relative to the other temporary xmm registers such that the end result looks like this:

tmp0= R0-zero-zero-zero

tmp1= zero-R1-zero-zero

tmp2= zero-zero-R2-zero

tmp3= zero-zero-zero-R3

I combine the values contained in each tmp variable into one xmm variable by summing them up with the following instructions:

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);

Finally, I add the register containing all 4 results of the dot products to a contiguous part of an array so that the array's indexes are incremented by a dot product, like so (C_0n are the 4 values currently in the array that is to be updated; C_2 is the address pointing to these 4 values):

tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);

I want to know if there is a less round-about, more efficient way to take the results of the dot products and add them to the contiguous chunk of the array. In this way, I am doing 3 additions between registers that only have 1 non-zero value in them. It seems there should be a more effective way to go about this.

I appreciate all help. Thank you.

celion · Accepted Answer · 2010-11-13T09:21:29.593

For code like this, I like to store the "transpose" of the A's and B's, so that {A_0m.x, A_1m.x, A_2m.x, A_3m.x} are stored in one vector, etc. Then you can do the dot product using just multiplies and adds, and when you're done, you have all 4 dot products in one vector without any shuffling.

This is used frequently in raytracing, to test 4 rays at once against a plane (e.g. when traversing a kd-tree). If you don't have control over the input data, though, the overhead of doing the transpose might not be worth it. The code will also run on pre-SSE4 machines, although that might not be an issue.

A small efficiency note on the existing code: instead of this

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

It may be slightly better to do this:

tmp0 = _mm_add_ps(tmp0, tmp1);  // 0 + 1 -> 0
tmp2 = _mm_add_ps(tmp2, tmp3);  // 2 + 3 -> 2
tmp0 = _mm_add_ps(tmp0, tmp2);  // 0 + 2 -> 0
tmp0 = _mm_add_ps(tmp0, C_0n);

As the first two mm_add_ps's are completely independent now. Also, I don't know the relative timings of adding vs. shuffling, but that might be slightly faster.

Hope that helps.

score 3 · Answer 2 · answered Dec 17 '10 at 11:03

It is also possible to use the SSE3 hadd. It turned out faster than using _dot_ps, in some trivial tests. This returns 4 dot products which could be added.

static inline __m128 dot_p(const __m128 x, const __m128 y[4])
{
   __m128 z[4];

   z[0] = x * y[0];
   z[1] = x * y[1];
   z[2] = x * y[2];
   z[3] = x * y[3];
   z[0] = _mm_hadd_ps(z[0], z[1]);
   z[2] = _mm_hadd_ps(z[2], z[3]);
   z[0] = _mm_hadd_ps(z[0], z[2]);

   return z[0];
}

score 1 · Answer 3 · answered Dec 22 '12 at 00:41

I realize this question is old, but why use _mm_add_ps at all? Replace it with:

tmp0 = _mm_or_ps(tmp0, tmp1);
tmp2 = _mm_or_ps(tmp2, tmp3);
tmp0 = _mm_or_ps(tmp0, tmp2);

You can probably hide some of the _mm_dp_ps latency. The first _mm_or_ps doesn't wait for the final 2 dot products either, and it's a (fast) bit-wise operation. Finally:

_mm_storeu_ps(C_2, _mm_add_ps(tmp0, C_0));

Crashworks · Answer 4 · 2010-11-13T09:42:13.563

1

You could try leaving the dot product result in the low word and use the scalar store op _mm_store_ss to save that one float from each m128 register into the appropriate location of the array. Nehalem's store buffer should accumulate consecutive writes on the same line and flush them to L1 in batches.

The pro way to do it is celion's transpose approach. MSVC's _MM_TRANSPOSE4_PS macro will do the transpose for you.

edited Nov 13 '10 at 09:42

answered Nov 13 '10 at 09:29

Crashworks

40,496
12
101
170

You'd still have to add the old value (C_0n) to each dot product before the store. They'd all be independent, so it might not be too slow, but it's not much prettier :) – celion Nov 13 '10 at 09:37

Most efficient way to store 4 dot products into a contiguous array in C using SSE intrinsics

4 Answers4