I'd just like to propose a different perspective. My contribution effectively just expands one comment by @PeterCordes below your original post, but it is too long to be posted as a reply to that comment.
I have just recently optimized an old simulation of mine. I had a problem similar to yours. I was simulating movement of particles and I had program that relied on the structures that were roughly like this:
struct Particle
{
double x, y, z; // x,y,z positions
double vx, vy, vz; // x,y,z velocities
};
I knew that simulations of individal particles were independent from one another, so I decided to use SIMD parallelism to make simulation faster. If I continued to rely on Particle structures, I would have to load every component of velocity and position into AVX register like this. For the sake of simplicity, I am pretending that my array consisted of only four particles, but in fact I was dealing with a heap array with thousands of particles:
void make_one_step(std::array<Particle, 4>& p, double time_step)
{
__m256d pos_x = _mm256_set_pd(p[3].x, p[2].x, p[1].x, p[0].x);
// do similar for y and z component of position and for all velocities
// compute stuff (bottleneck)
// __m256d new_pos_x = ...
// do this for every component of velocity and position
double vals[4];
_mm256_store_pd(vals, new_pos_x);
for(int i = 0; i < 4; ++i) p[i].x = vals[i];
}
void simulate_movement(std::array<Particle, 4>& p)
{
for( ... lots of steps ...)
{
make_one_step(p, time_step); // bottleneck
// check values of some components and do some cheap operations
}
}
Truth be told, I had to compute so much stuff in the simulation (some relatively advanced physics) that loading and storing were not bottleneck at all. But the ugliness of this repacking on the every step of the procedure gave me additional motivation to fix things. I did not change the inputs of the algorithm (i still used Particle objects), but inside the algorithm itself I recombined the data from four Particles and stored it inside a structure like this:
struct Quadruple
{
double pos_x[4];
// and similar for other position/velocity components
}
The simulation looked like that after these changes. Long story short, I just modified the layer between the algorithm and interface. Efficient loading? Check. Input data unchanged? Check.
void make_one_step(Quadruple& p, double time_step)
{
__m256d pos_x = _mm256_load_pd(p.pos_x); // for every component
// compute stuff in same way as before
_mm256_store_pd(p.pos_x, new_pos_x); // for every component
}
void simulate_movement(std::Array<Particle, 4> &particles, double time_step)
{
//Quadruple q = ... // store data in Quadruple
for( ... a lot of time steps ... )
{
make_one_step(q, time_step); //bottleneck
// check values of some components and do some cheap operations
}
// get data from quadruple and store it in an array of particles
}
Unfortunately I can't tell if this helps you or not; it depends on what you do. If you need all data in an array before you start the computation my advice will not help you, and if the recombining data itself turns out to be a bottleneck, it will be equally useless. :) Good luck.