How to optimize "u[0]v[0] + u[2]v[2]" code line with SSE or GLSL

Question

I have the following function (from the opensource project "recast navigation"):

/// Derives the dot product of two vectors on the xz-plane. (@p u . @p v)
///  @param[in]     u       A vector [(x, y, z)]
///  @param[in]     v       A vector [(x, y, z)]
/// @return The dot product on the xz-plane.
///
/// The vectors are projected onto the xz-plane, so the y-values are ignored.
inline float dtVdot2D(const float* u, const float* v)
{
    return u[0]*v[0] + u[2]*v[2];
}

I ran it through the VS2010 CPU performance test and it showed me that in all recast codebase codeline in this function u[0]*v[0] + u[2]*v[2] is most CPU hot .

How can I CPU optimize (via SSE or GLSL like GLM (if it is easier or faster and appropriate in such case) for example) this line?

Edit: The context in which the calls appear:

bool dtClosestHeightPointTriangle(const float* p, const float* a, const float* b, const float* c, float& h) {
    float v0[3], v1[3], v2[3];
    dtVsub(v0, c,a);
    dtVsub(v1, b,a);
    dtVsub(v2, p,a);

    const float dot00 = dtVdot2D(v0, v0);
    const float dot01 = dtVdot2D(v0, v1);
    const float dot02 = dtVdot2D(v0, v2);
    const float dot11 = dtVdot2D(v1, v1);
    const float dot12 = dtVdot2D(v1, v2);

    // Compute barycentric coordinates
    const float invDenom = 1.0f / (dot00 * dot11 - dot01 * dot01);
    const float u = (dot11 * dot02 - dot01 * dot12) * invDenom;
    const float v = (dot00 * dot12 - dot01 * dot02) * invDenom;

I doubt this leaf function can be optimized further because it only does 3 FP operations. Possibly at the calling sites a optimization is possible. — Gunther Piez, Jun 20 '12 at 17:55
If this function is called a lot, trying using OpenMP if it makes sense. — jn1kk, Jun 20 '12 at 18:18
IMO that isn't the right approach. SSE isn't really meant for horizontal operations. There are some horizontal instructions, but they are (almost) all slow. With SSE it's almost always better to calculate 4 of something at once rather than trying to do 1 thing 4 times as fast. — harold, Jun 20 '12 at 18:34
I added some more information. Obviously the code could be optimized using SSE, and also obvious is that the data layout isn't very clever for this kind of work and would profit from restructuring. — Gunther Piez, Jun 21 '12 at 01:36
What range of values can you expect to be in u[0],v[0],u[2], and v[2]? Also, what is the minimum precision you require (i.e. how much granularity error would be acceptable)? — phonetagger, Jun 25 '12 at 16:58
It should be noted, that GLM isn't using GLSL in any way. It is designed after the GLSL, but it doesn't magically do its computations inside a shader (which would be plain rubbish, besides not that easy to achieve properly). It's a CPU 3D math library like any others (though able to use SSE internally), just modelled after the GLSL. — Christian Rau, Jun 27 '12 at 12:59
Protip: If you (in your question title) have two sets (one of which you forget to close) of parentheses, then (if pretend to be Yoda you also do) reading the title becomes a bit of a pain in the ass. Just sayin'. Hope you don't mind that I tried to make the title a bit more readable — jalf, Jun 27 '12 at 13:37

nielsj · Accepted Answer · 2012-07-02T20:25:22.330

After trying a few things on paper I've come up with something that might work for you. It is a properly parallelized/vectorized implementation of the function in SSE.

It does however require data reorganization because we'll do parallel processing for 4 triangles at once.

I will break it up in steps and use instruction names here and there, but please use the C intrinsics (_mm_load_ps(), _mm_sub_ps() et al, they're in xmmintrin.h in VC) -- when I speak of registers this just means __m128.

STEP 1.

We do not need the Y coordinate at all, so we set up pointers to pairs of X and Z. Supply at least 4 pairs (i.e. 4 triangles total) per call. I'll call each X and Z pair a vertex.

STEP 2.

Use MOVAPS (requires the pointers to be aligned to 16-bit) to load the first two vertices pointed to by each pointer into registers.

A register loaded from a will look like this: [ a0.x, a0.z, a1.x, a1.z ]

STEP 3.

Now, using a single subtract instruction, you can calculate the deltas (your v0, v1, v2) for 2 vertices at once.

Calculate v0, v1 and v2 not only for the first 2 triangles, but also for the latter 2! As I said you should supply a total of 4 vertices, or 8 floats, per input. Just repeat steps 2 and 3 for that data.

Now we have 2 pairs of vx registers, each pair containing the result for 2 triangles. I'll refer to them as vx_0 (first pair) and vx_1 (second pair) where x is from 0 to 2.

STEP 4.

Dot products. In order to parallelize the barycentric calculation (later on) we require the result of each dot product for each of the 4 triangles, in 1 single register.

So where you'd calculate dot01 for example, we will do the same, but for 4 triangles at once. Each v-register contains the result for 2 vectors, so we begin by multiplying them.

Let us say that u and v -- parameters in your dot product function -- are now v0_0 and v1_0 (as to calculate dot01):

Multiply u and v to get: [ (v0_0.x0) * (v1_0.x0), (v0_0.z0) * (v1_0.z0), (v0_0.x1) * (v1_0.x1), (v0_0.z1) * (v1_0.z1) ]

This may look confusing because of .x0 / .z0 and .x1 / .z1, but look at what was loaded at step 2: a0, a1.

If by now this still feels fuzzy, pick up a piece of paper and write along from the start.

Next, still for the same dot product, do the multiplication for v0_1 and v1_1 (i.e. the second pair of triangles).

Now we have 2 registers with 2 X & Z pairs each (4 vertices total), multiplied and ready to be added together to form 4 separate dot products. SSE3 has an instruction to do exactly this, and it's called HADDPS:

xmm0 = [A, B, C, D] xmm1 = [E, F, G, H]

HADDPS xmm0, xmm1 does this:

xmm0 = [A+B, C+D, E+F, G+H]

It will take the X & Z pairs from our first register, those from the second, add them together and store them in the first, second, third and fourth component of the destination register. Ergo: at this point we've got this particular dot product for all 4 triangles!

**Now repeat this process for all the dot products: dot00 et cetera. **

STEP 5.

The last calculation (as far as I could see by the supplied code) is the barycentric stuff. This is a 100% scalar calculation in your code. But your inputs now are not scalar dot product results (i.e. single floats), they are SSE vectors/registers with a dot product for each of the 4 triangles.

So if you flat out vectorize this by using the parallel SSE operations that operate on all 4 floats, you'll eventually end up with 1 register (or result) carrying 4 heights, 1 for each triangle.

Since my lunch break is well past due I'm not going to spell this one out, but given the setup/idea I've given this is a last step and shouldn't be hard to figure out.

I know that this idea is a bit of a stretch and requires some loving from the code that sits above it and perhaps some quality time with pencil and paper, but it will be fast (and you can even add OpenMP afterwards if you'd like).

Good luck :)

(and forgive my fuzzy explanation, I can whip up the function if need be =))

UPDATE

I've written an implementation and it did not go as I expected, mainly because the Y component got involved beyond the piece of code that you initially pasted in your question ( I looked it up). What I've done here is not just ask you to rearrange all points in to XZ pairs and feed them per 4, but also feed 3 pointers (for points A, B and C) with the Y values for each of the 4 triangles. From a local perspective this is fastest. I can still modify it to require less intrusive changes from the callee's end, but please let me know what's desirable.

Then a disclaimer: this code is straightforward as hell (something I've found to work pretty well with compilers in terms of SSE... they can reorganize as see fit and x86/x64 CPUs take their share in it too). Also the naming, it's not my style, it's not anyone's, just do with it what you see fit.

Hope it helps and if not I'll gladly go over it again. And if this is a commercial project there's also the option of getting me on board I guess ;)

Anyway, I've put it on pastebin: http://pastebin.com/20u8fMEb

I've made a fatal mistake -- the Y coordinate is used to interpolate later on. Damnit. Anyway, I'll try to fix it. — nielsj, Jul 02 '12 at 19:56

score 4 · Answer 2 · answered Jun 20 '12 at 18:31

You can implement your single dot product with SSE instructions but the result will not be significantly faster (and may even be slower) than the code as written now. Your rewrite may defeat compiler optimizations that are helping the current version.

In order to reap any benefit from rewriting that with SSE or CUDA you have to optimize the loop that's calling for that dot product. This is particularly true for CUDA where the overhead of executing one dot product would be huge. You could only see a speedup if you sent thousands of vectors into the GPU to compute thousands of dot products. The same idea holds for SSE on the CPU but you may be able to see an improvement over a smaller number of operations. It will still be more than one dot product, though.

The easiest thing to try might be g++ -ftree-vectorize. GCC will be able to inline your small function and then try to optimize the loop for you (in fact it probably already is, but without SSE instructions). The tree vectorizer will attempt to do automatically what you propose to do by hand. It's not always successful.

Don't forget -msse or -msse2 (etc) either. – NotKyon Jun 21 '12 at 04:47 — NotKyon, Jun 21 '12 at 04:47

truthseeker · Answer 3 · 2012-06-27T14:15:25.307

SSE instructions are meant for optimizing alghoritms that are processing large blocks of data represented as integers or floating point numbers. Typical sizes are millions and billions of numbers that need to be somehow processed. It doesnt make sense to optimize function that processes only four (or twenty) scalars. What you gain with SSE you could loss with function calling overhead. Reasonable number of numbers processed by one call of function is at least thousand. It is possible that you can obtain huge performance gain by using SSE intrinsics. But its hard to give you specific advice tailored to your needs based on information you provided. You should edit your question and provide more high level view of your problem (functions located deeper on your callstack). It is for example not obvious how many times is dtClosestHeightPointTriangle method called per second? This number is critical to objectively judge if transition to SSE would be of practical value. Organization of data is also very important. Ideally your data should be stored in as few as possible linear segments of memory to utilize cache subsystem of CPU efficiently.

If you process small blocks of floats *very* often, it certainly makes sense to vectorize that. It's harder because usually there's some shuffling (e.g. to get a single scalar result), and this is a much bigger fraction of the total time than for the sum of a massive array. So manually vectorizing hot functions that work with small amounts of data is more likely to be needed vs. auto-vectorization. — Peter Cordes, Jan 22 '18 at 05:17

score 2 · Answer 4 · answered Jun 27 '12 at 13:29

You asked for an SSE version of your algorithm so here it is:

// Copied and modified from xnamathvector.inl
XMFINLINE XMVECTOR XMVector2DotXZ
(
    FXMVECTOR V1, 
    FXMVECTOR V2
)
{
#if defined(_XM_NO_INTRINSICS_)

    XMVECTOR Result;

    Result.vector4_f32[0] =
    Result.vector4_f32[1] =
    Result.vector4_f32[2] =
    Result.vector4_f32[3] = V1.vector4_f32[0] * V2.vector4_f32[0] + V1.vector4_f32[2] * V2.vector4_f32[2];

    return Result;

#elif defined(_XM_SSE_INTRINSICS_)
    // Perform the dot product on x and z
    XMVECTOR vLengthSq = _mm_mul_ps(V1,V2);
    // vTemp has z splatted
    XMVECTOR vTemp = _mm_shuffle_ps(vLengthSq,vLengthSq,_MM_SHUFFLE(2,2,2,2));
    // x+z
    vLengthSq = _mm_add_ss(vLengthSq,vTemp);
    vLengthSq = _mm_shuffle_ps(vLengthSq,vLengthSq,_MM_SHUFFLE(0,0,0,0));
    return vLengthSq;
#else // _XM_VMX128_INTRINSICS_
#endif // _XM_VMX128_INTRINSICS_
}

bool dtClosestHeightPointTriangle(FXMVECTOR p, FXMVECTOR a, FXMVECTOR b, FXMVECTOR c, float& h)
{
    XMVECTOR v0 = XMVectorSubtract(c,a);
    XMVECTOR v1 = XMVectorSubtract(b,a);
    XMVECTOR v2 = XMVectorSubtract(p,a);

    XMVECTOR dot00 = XMVector2DotXZ(v0, v0);
    XMVECTOR dot01 = XMVector2DotXZ(v0, v1);
    XMVECTOR dot02 = XMVector2DotXZ(v0, v2);
    XMVECTOR dot11 = XMVector2DotXZ(v1, v1);
    XMVECTOR dot12 = XMVector2DotXZ(v1, v2);

    // Compute barycentric coordinates
    XMVECTOR invDenom = XMVectorDivide(XMVectorReplicate(1.0f), XMVectorSubtract(XMVectorMultiply(dot00, dot11), XMVectorMultiply(dot01, dot01)));

    XMVECTOR u = XMVectorMultiply(XMVectorSubtract(XMVectorMultiply(dot11, dot02), XMVectorMultiply(dot01, dot12)), invDenom);
    XMVECTOR v = XMVectorMultiply(XMVectorSubtract(XMVectorMultiply(dot00, dot12), XMVectorMultiply(dot01, dot02)), invDenom);
}

The XMVector2Dot is taken from xnamathvector.inl, I renamed it and modified to operate on the X/Z coordinates.

XNAMath is a great vectorized cross-platform math library by Microsoft; I use it on OS X as well by importing the sal.h header (I'm not sure about licensing issue there so watch out).
In fact, any platform that supports intrinsics SSE should support it.

A couple of things to watch for:

You need to load your floats into XMVECTORs using the XMLoadFloat3 method; this will load un-aligned float3 into an __m128 structure.
You will probably not see any performance improvement from this SSE code (profile !!) as there is a performance penalty for loading un-aligned floats into SSE registers.
This is a brute-force conversion of the algorithm to SSE, you will have better luck by being smarter than me and actually trying to understand the algorithm and implementing an SSE friendly version.
You will have better luck by converting the entire application to use XNA Math/SSE code rather than just that small portion. At least enforce the use of aligned vector types (XMFLOAT3A or struct __declspec(align(16)) myvectortype { };).
Straight SSE assembly is discouraged especially in x64, in favor of intrinsics.

How to optimize "u[0]*v[0] + u[2]*v[2]" code line with SSE or GLSL

4 Answers4

How to optimize "u[0]v[0] + u[2]v[2]" code line with SSE or GLSL