2

So I have a Array of Structs:

typedef struct {
   float x;
   float y;
   float z;
} Point;

const int SIZE = 16;

Point* points;
points = malloc(SIZE * sizeof(Point));

Now I have also a Struct of Arrays:

typedef struct {
    float* vectorX;
    float* vectorY;
    float* vectorZ;
} arrayStruct;

arrayStruct myArrayStruct;

// Allocate Memory
myArrayStruct.vectorX = _aligned_malloc(sizeof(float)* SIZE, 32);
myArrayStruct.vectorY = _aligned_malloc(sizeof(float)* SIZE, 32);
myArrayStruct.vectorZ = _aligned_malloc(sizeof(float)* SIZE, 32);

So now my question would be: Is there a fast/simple way to convert the AoS (Array of structs) to a Struct of Arrays using SIMD (Intrinsics)?

Samuel Dressel
  • 1,181
  • 2
  • 13
  • 27
  • 1
    What do you mean by "using SIMD"? SIMD performs computations on vector registers, it cannot change the structure of the code you give it. – andreee Apr 30 '19 at 13:41
  • 2
    If speed is important, then why are you using malloc for small, fixed-size arrays? – Lundin Apr 30 '19 at 13:42
  • 2
    If SIZE is fixed and you can make arrayStruct contiguous then the problem reduces to a 16x3 transpose. – Paul R Apr 30 '19 at 14:59
  • 1
    For an AVX2 solution see: [https://stackoverflow.com/questions/44984724/whats-the-fastest-stride-3-gather-instruction-sequence](https://stackoverflow.com/questions/44984724/whats-the-fastest-stride-3-gather-instruction-sequence) – wim May 01 '19 at 20:19

1 Answers1

5

You didn't actually specify an instruction set to use, so here's an SSE4 implementation. Whether you are using SSE/AVX2/AVX512, you can basically utilise a series of blend and shuffle ops (and some additional 128bit permutations for AVX+). Blend and shuffle both have a latency of 1 and throughput of 0.33, so that should satisfy the 'quick' requirement. So starting with 4xVec3 in AOS format:

r0 = [x0 y0 z0 x1]
r1 = [y1 z1 x2 y2]
r2 = [z2 x3 y3 z3]

You should be able to do something along these lines:

template<bool c0, bool c1, bool c2, bool c3>
inline f128 blend4f(const f128 tr, const f128 fr) 
  { return _mm_blend_ps(fr, tr, (c3 << 3)  | (c2 << 2) | (c1 << 1) | c0); }

template<uint8_t X, uint8_t Y, uint8_t Z, uint8_t W>
inline f128 shuffle4f(const f128 a, const f128 b) 
  { return _mm_shuffle_ps(a, b, _MM_SHUFFLE(W, Z, Y, X)); }

inline void vec3_aos2_soa(
    const f128 r0, const f128 r1, const f128 r2, 
    f128& x, f128& y, f128& z)
{
  x = blend4f<1, 0, 0, 1>(r0, r1);  // x0 z1 x2 x1
  y = blend4f<1, 0, 0, 1>(r1, r2);  // y1 x3 y3 y2
  z = blend4f<1, 0, 0, 1>(r2, r0);  // z2 y0 z0 z3

  x = blend4f<1, 0, 1, 1>(x, r2);   // x0 x3 x2 x1
  y = blend4f<1, 0, 1, 1>(y, r0);   // y1 y0 y3 y2
  z = blend4f<1, 0, 1, 1>(z, r1);   // z2 z1 z0 z3

  x = shuffle4f<0, 3, 2, 1>(x, x);  // x0 x1 x2 x3
  y = shuffle4f<1, 0, 3, 2>(y, y);  // y0 y1 y2 y3
  z = shuffle4f<2, 1, 0, 3>(z, z);  // z0 z1 z2 z3
}

To go back the other way, shuffle, and then blend back to the starting point.

robthebloke
  • 9,331
  • 9
  • 12