0

Imagine that i have an array of objects, like this:

class Segment {
public:
    float x1, x2, y1, y2;
}

Segment* SegmentList[100];

Based on this array of Segments, I want to quickly extract its properties and create vectors with all the x1, x2, y1 and y2, like that;

float x1List[100];

for(int=0; i<100; i++) {
    x1List[i] = SegmentList[i]->x1;
}

I wonder if there is a faster way to read all the "x1" properties into an array.

UPDATE 1:

Since i will be using this array to be loaded into AVX registers, i could rephrase my question as:

"Is there a faster way to load the properties of an array of objects into AVX registers?"

Alkin
  • 1
  • 1
  • With AVX1, I don't think so. With AVX2, you could use `vpermps` to shuffle the low 32-bit element (or any other element) of two 128-bit lanes to the bottom of a SIMD vector. But really, changing your data layout might be a better option. Instead of storing an array of structs, can you store a segmentlist as 4 different arrays of `x1[]`, `x2[]`, `y1[]`, and `y2[]`? (Or I guess pointers, not fixed-size arrays, so you can dynamically allocate them.) Interleaving in blocks of 4 or 8 is also an option. – Peter Cordes Mar 05 '18 at 21:59
  • 1
    Or with AVX2, `vgatherdps` is worth considering if tuning for Skylake, but not for earlier CPUs (where gather is not faster than scalar load / merge). – Peter Cordes Mar 05 '18 at 22:07
  • You might be able to use unaligned loads + blends to create a vector with all the data you need, then shuffle that. But on some CPUs (e.g. Sandybridge), unaligned 256-bit loads are not fast. – Peter Cordes Mar 05 '18 at 22:21
  • 1
    Good, i am checking the avx instructions you mentioned... The real world Segment class is a bit more complex with other properties and methods. I can't change the Segment class, because i receive this SegmentList from another module that i am not supposed to change as well. My main goal is to "extract" these properties so i can run some calculations on them usign AVX instructions. – Alkin Mar 05 '18 at 22:23
  • Ugh, that's unfortunate. Maybe you can overlap the computation you want to do with the strided loads into a SIMD vector? Or if you need more than one property, at least grab them all in a single pass, producing multiple output vectors at once. (If so, shuffling so you can store in at least 64-bit or 128-bit chunks will probably be good, but it's a tradeoff between the store bottleneck (1 per clock) vs. a shuffle bottleneck (1 shuffle per clock on Intel CPUs). – Peter Cordes Mar 05 '18 at 22:31
  • See [these slides](https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/) for more about data layout and SIMD, and other links in https://stackoverflow.com/tags/sse/info. – Peter Cordes Mar 05 '18 at 22:32
  • Great =) Thanks Peter. I will have a better look on your links and learn a bit more about simd instructions. – Alkin Mar 05 '18 at 22:43

1 Answers1

0

I'd just like to propose a different perspective. My contribution effectively just expands one comment by @PeterCordes below your original post, but it is too long to be posted as a reply to that comment.

I have just recently optimized an old simulation of mine. I had a problem similar to yours. I was simulating movement of particles and I had program that relied on the structures that were roughly like this:

struct Particle
{
  double x, y, z; // x,y,z positions
  double vx, vy, vz; // x,y,z velocities
};

I knew that simulations of individal particles were independent from one another, so I decided to use SIMD parallelism to make simulation faster. If I continued to rely on Particle structures, I would have to load every component of velocity and position into AVX register like this. For the sake of simplicity, I am pretending that my array consisted of only four particles, but in fact I was dealing with a heap array with thousands of particles:

void make_one_step(std::array<Particle, 4>& p, double time_step)
{
  __m256d pos_x = _mm256_set_pd(p[3].x, p[2].x, p[1].x, p[0].x);
  // do similar for y and z component of position and for all velocities

  // compute stuff (bottleneck)
  // __m256d new_pos_x = ...

  // do this for every component of velocity and position
  double vals[4];
  _mm256_store_pd(vals, new_pos_x);
  for(int i = 0; i < 4; ++i) p[i].x = vals[i]; 
}

void simulate_movement(std::array<Particle, 4>& p)
{
  for( ... lots of steps ...)
  {
    make_one_step(p, time_step); // bottleneck
    // check values of some components and do some cheap operations
  }
}

Truth be told, I had to compute so much stuff in the simulation (some relatively advanced physics) that loading and storing were not bottleneck at all. But the ugliness of this repacking on the every step of the procedure gave me additional motivation to fix things. I did not change the inputs of the algorithm (i still used Particle objects), but inside the algorithm itself I recombined the data from four Particles and stored it inside a structure like this:

struct Quadruple
{
  double pos_x[4]; 
  // and similar for other position/velocity components
}

The simulation looked like that after these changes. Long story short, I just modified the layer between the algorithm and interface. Efficient loading? Check. Input data unchanged? Check.

void make_one_step(Quadruple& p, double time_step)
{
  __m256d pos_x = _mm256_load_pd(p.pos_x); // for every component

  // compute stuff in same way as before

  _mm256_store_pd(p.pos_x, new_pos_x); // for every component
}

void simulate_movement(std::Array<Particle, 4> &particles, double time_step)
{
   //Quadruple q = ... // store data in Quadruple

   for( ... a lot of time steps ... )
   {
     make_one_step(q, time_step); //bottleneck
    // check values of some components and do some cheap operations
   }

  // get data from quadruple and store it in an array of particles
}

Unfortunately I can't tell if this helps you or not; it depends on what you do. If you need all data in an array before you start the computation my advice will not help you, and if the recombining data itself turns out to be a bottleneck, it will be equally useless. :) Good luck.

Nejc
  • 927
  • 6
  • 15
  • "*inside the algorithm itself I recombined the data from four Particles and stored it inside a structure like this*": That's the step the OP is asking how to speed up, probably for exactly the reason your answer discusses. If you can't avoid that SoA -> AoS step, you can at least speed it up with SIMD shuffles, especially if you're extracting multiple "properties" in one pass, which is why I asked the OP if that's how they'd use this. – Peter Cordes Mar 08 '18 at 22:20
  • 1
    I know and I am aware I am not answering his question directly as I state in first and last paragraph. I just thought that the example is a useful remainder that A) "taking advantage of SIMD" does not always equal "you have to shove all input data into few big arrays to use them" and B) perhaps the problem can be rearranged so that less extractions are done (if extraction operation is a bottleneck this can represent an important optimization). But of course this depends heavily on the domain/type of problem he is solving. – Nejc Mar 08 '18 at 22:54
  • Thanks. Your problem is very similar to mine. I have this Segment list that i am supposed to do some calcutions with, using AVX / SIMD. I like your idea to work on quadruples, but i believe it has the same speed as loading it in an array of properties. Your solution is more elegant i guess. – Alkin Mar 09 '18 at 21:15
  • Sometimes a smaller intermediate buffer is good choice. Let's suppose we have image with dimensions 1k x 1k, and 1MB size. We want to consecutively process it with algorithms A and B; both return images with same size as the input. We are only interested in end result, not the intermediate result. If operations on pixels are independent from neighboring pixels in both algorithms, we can give the procedure a nice performance boost by doing it row-per-row (1000 times). The intermediate buffer is small and we can count on data not leaving the L1 cache between calls to A and B. – Nejc Mar 09 '18 at 22:09
  • In my case, the Quadruple plays the role of intermediate buffer: I have to do thousands of operations repeatedly on every one of tens-of-thousands of particles. If I resort to SIMD parallelism, I can no longer work on one particle at a time. But if I start working on 10k particles in parallel, the intermediate results will have several MB in size. That's why I work on four at a time: a quadruple is so small it will not leave L1 cache between taking time steps. But yes, I am aware that you might not be facing the same problem - I just wanted to illustrate why a smaller array helps. – Nejc Mar 09 '18 at 22:29