A better SSE2 implementation for float4::set_wxy (and other set-swizzle ops)?

Question

I'm writing an HLSL float4 compliant type in C++ with SSE2/AVX intrinsics and at the moment I'm implementing all the set-swizzle operations available for float4 in HLSL. I'm trying to figure out an optimal SSE2 implementation to deal with set-swizzle operations involving (swizzle) setting 2 or 3 components (as 4-component set-swizzles are trivial to implement with one SSE shuffle op). For example I can't figure out a better way to implement say set_wxy without at least 4/5 SSE shuffle ops e.g.:

inline/__forceinline void float4::set_wxy(const float4& x)
{
    float4 tmp2 = *this;
    tmp2.set_wxyz(x);                         // set_wxyz = 1 x _mm_shuffle_ps
    const __m128 xyw_tmp = tmp2.zxyw().data;  // zxyw() = 1 x _mm_shuffle_ps
    const __m128 z_tmp = zxyw().data;         // zxyw() = 1 x _mm_shuffle_ps
    tmp2 = _mm_move_ss(xyw_tmp, z_tmp);
    set_zxyw(tmp2);                           // set_zxyw() = 1 x _mm_shuffle_ps
}

Does anyone have any ideas for a better implementation without using operations beyond SSE2? as I am aware of _mm_blend_ps in SSE4/AVX for which I will use when available via preprocessor conditionals but I want to support at least an SSE2 only code path. Thanks in advance!

EDIT: An example of the behavior of this function is:

float4 k(5,5,5,5);
k.set_wxy(float4(1,2,3,4));
// now k == (2, 3, 5, 1)

Basically set_wxy sets the w,x,y components using the arguments of x,y,z in this order, the original z value is preserved.

What is set_wxy supposed to do, exactly? I tried to infer it from the code, but too much of it is hidden. — harold, Jul 08 '12 at 16:09
@harold Okay I've just added some more info about the behavior of these functions, I hope this clears things up a bit. Thanks for reading. — snk_kid, Jul 08 '12 at 16:25
have a look at the source for DirectXMath in the Windows 8 SDK(`DirectXMath.h`), it has very fast SSE2 ops for swizzles etc (and its written to be compliant with DX & HLSL). TBH, you could probably use DX math outright, save yourself the effort — Necrolis, Jul 08 '12 at 16:29
@Necrolis Don't believe that DirectXMath has pre-written functions for every single swizzle. You'd still have to write this function. — John Calsbeek, Jul 08 '12 at 16:30
@JohnCalsbeek: `XMVectorSelect`: http://msdn.microsoft.com/en-us/library/windows/desktop/microsoft.directx_sdk.component-wise.xmvectorselect(v=vs.85).aspx DirectXMath basically covers *everything* you could want to do these days (it even has basic collision tests), plus, its been optimized quite a bit with the use of templates etc, so selection overhead can be removed at compile time. — Necrolis, Jul 08 '12 at 16:33
@@Necrolis Yeah I just started looking at this just before I got your message. I would like to not depend on directx utility libraries as I want my project to be cross-platform/arch and support the same interface for using C++AMP as well. — snk_kid, Jul 08 '12 at 16:35
@Necrolis `XMVectorSelect` does selection, but doesn't do permutation. You need at least two function calls. — John Calsbeek, Jul 08 '12 at 16:36
@snk_kid: its just a bunch of headers (`.h` & `.inl`), very easy to just add them to your project, else, just use there implementations to create your own (ie: use them as a reference) — Necrolis, Jul 08 '12 at 16:37
@JohnCalsbeek: there is also `XMVectorPermute` & `XMVectorSwizzle`, my point is, look at the source and docs for this before going crazy.... (and if DirectXMath requires 2+ functions to do something in SSE2, you will also need 2+ [intrinsic] functions) — Necrolis, Jul 08 '12 at 16:39

John Calsbeek · Answer 1 · 2012-07-08T16:57:55.350

You're trying to emulate this line of HLSL, right?

vec2.wxy = vec1.xyz;

You can get somewhere by using the fact that _mm_shuffle_ps can combine two vectors in a sort of limited fashion. Here's my stab at it:

// xyzw is vec1, XYZW is vec2
__m128 xxZZ = _mm_shuffle_ps(vec1, vec2, _MM_SHUFFLE(2, 2, 0, 0));
__m128 ZxZx = _mm_shuffle_ps(xxZZ, xxZZ, _MM_SHUFFLE(0, 2, 0, 2));
__m128 yzZx = _mm_shuffle_ps(vec1, ZxZx, _MM_SHUFFLE(1, 0, 2, 1));

vec2 = yzZx;

A better SSE2 implementation for float4::set_wxy (and other set-swizzle ops)?

1 Answers1