I want to return the result of _mm_add_ps() but the returning type should be a custon union that has __m128 member inside.
I tested the performance of returning __m128 and a custom union. It seems that on MSVC this:
return _mm_add_ps(V1, V2);
is faster than this:
vector4 x = { .vector = _mm_add_ps(left.vector, right.vector) };
return x;
where vector4 is defined as:
typedef union vector4
{
struct { float x; float y; float z; float w; };
struct { float r; float g; float b; float a; };
struct { float s; float t; float m; float q; };
float points[4];
__m128 vector;
} __declspec(align(16)) vector4;
I wonder if I can just cast the result of the _mm_add_ps() which is __m128 to union vector4 directly to avoid this performance differance. This is also measured in release build.
I tried to use: return (vector4) _mm_add_ps(left.vector, right.vector);
But it doesn't work. Returning the error: No suitable conversion from __m128 to "float" exists.