I'm writing an HLSL float4 compliant type in C++ with SSE2/AVX intrinsics and at the moment I'm implementing all the set-swizzle operations available for float4 in HLSL. I'm trying to figure out an optimal SSE2 implementation to deal with set-swizzle operations involving (swizzle) setting 2 or 3 components (as 4-component set-swizzles are trivial to implement with one SSE shuffle op). For example I can't figure out a better way to implement say set_wxy without at least 4/5 SSE shuffle ops e.g.:
inline/__forceinline void float4::set_wxy(const float4& x)
{
float4 tmp2 = *this;
tmp2.set_wxyz(x); // set_wxyz = 1 x _mm_shuffle_ps
const __m128 xyw_tmp = tmp2.zxyw().data; // zxyw() = 1 x _mm_shuffle_ps
const __m128 z_tmp = zxyw().data; // zxyw() = 1 x _mm_shuffle_ps
tmp2 = _mm_move_ss(xyw_tmp, z_tmp);
set_zxyw(tmp2); // set_zxyw() = 1 x _mm_shuffle_ps
}
Does anyone have any ideas for a better implementation without using operations beyond SSE2? as I am aware of _mm_blend_ps in SSE4/AVX for which I will use when available via preprocessor conditionals but I want to support at least an SSE2 only code path. Thanks in advance!
EDIT: An example of the behavior of this function is:
float4 k(5,5,5,5);
k.set_wxy(float4(1,2,3,4));
// now k == (2, 3, 5, 1)
Basically set_wxy sets the w,x,y components using the arguments of x,y,z in this order, the original z value is preserved.