Understanding std::transform and how to beat it

Question

I am trying to understand data-oriented design on a simple, specific problem. Apologies in advance to data-oriented design people, if I am doing something very stupid, but I am having a hard time understanding why and where my reasoning fails.

Assume that I have a simple operation, i.e., float_t result = int_t(lhs) / int_t(rhs). If I keep all of the variables in their corresponding containers, e.g., std::vector<float_t> and std::vector<int_t>, and I use std::transform, I get the correct result. Then, for a specific example where using float_t = float and using int_t = int16_t, I assume that packing these variables inside a struct, on a 64-bit architecture, and collecting them inside a container should yield better performance.

I reason that the struct makes up a 64-bit object, and a single memory access to the struct will give me all the variables I need. On the other hand, when all these variables are collected in different containers, I will need three different memory accesses to get the information needed. Below is how I set up the environment:

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>

using namespace std::chrono;

template <class float_t, class int_t> struct Packed {
  float_t sinvl;
  int_t s, l;
  Packed() = default;
  Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
  void comp() { sinvl = float_t(l) / s; }
};

using my_float = float;
using my_int = int16_t;

int main(int argc, char *argv[]) {
  constexpr uint32_t M{100};
  for (auto N : {1000, 10000, 100000}) {
    double t1{0}, t2{0};
    for (uint32_t m = 0; m < M; m++) {
      std::vector<my_float> sinvl(N, 0.0);
      std::vector<my_int> s(N, 3), l(N, 2);
      std::vector<Packed<my_float, my_int>> p1(
          N, Packed<my_float, my_int>(0.0, 3, 2));

      // benchmark unpacked
      auto tstart = high_resolution_clock::now();
      std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
                     std::divides<my_float>{}); // 3 different memory accesses
      auto tend = high_resolution_clock::now();
      t1 += duration_cast<microseconds>(tend - tstart).count();

      if (m == M - 1)
        std::cout << "sinvl[0]: " << sinvl[0] << '\n';

      // benchmark packed
      tstart = high_resolution_clock::now();
      for (auto &elem : p1) // 1 memory access
        elem.comp();
      tend = high_resolution_clock::now();
      t2 += duration_cast<microseconds>(tend - tstart).count();

      if (m == M - 1)
        std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
    }
    std::cout << "N = " << N << ", unpacked: " << (t1 / M) << " us.\n";
    std::cout << "N = " << N << ", packed: " << (t2 / M) << " us.\n";
  }
  return 0;
}

The compiled code with g++ -O3 yields, on my machine,

sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 1000, unpacked: 0 us.
N = 1000, packed: 1 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 10000, unpacked: 5.06 us.
N = 10000, packed: 12.97 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 100000, unpacked: 52.31 us.
N = 100000, packed: 124.49 us.

Basically, std::transform beats the packed access by 2.5x. I would appreciate if you helped me understand the behaviour. Is the result due to

me not grasping the data-oriented design principles correctly, or,
some artefact of this very simple example such as the memory locations getting allocated very close to each other and in some way getting optimized very efficiently by the compiler?

Finally, is there a way to beat std::transform in this example, or, is it simply good enough to be a go-to solution? I am not an expert neither in compiler optimizations nor in data-oriented design, and thus, I could not answer this question myself.

Thanks!

EDIT. I have changed the way I test both of the methods as per @bolov's suggestion in the comments.

Now the code looks like:

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>

using namespace std::chrono;

template <class float_t, class int_t> struct Packed {
  float_t sinvl;
  int_t s, l;
  Packed() = default;
  Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
  void comp() { sinvl = float_t(l) / s; }
};

using my_float = float;
using my_int = int16_t;

int main(int argc, char *argv[]) {
  uint32_t N{1000};
  double t{0};

  if (argc == 2)
    N = std::stoul(argv[1]);

#ifndef _M_PACKED
  std::vector<my_float> sinvl(N, 0.0);
  std::vector<my_int> s(N, 3), l(N, 2);

  // benchmark unpacked
  auto tstart = high_resolution_clock::now();
  std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
                 std::divides<my_float>{}); // 3 different memory accesses
  auto tend = high_resolution_clock::now();
  t += duration_cast<microseconds>(tend - tstart).count();

  std::cout << "sinvl[0]: " << sinvl[0] << '\n';
  std::cout << "N = " << N << ", unpacked: " << t << " us.\n";
#else
  std::vector<Packed<my_float, my_int>> p1(N,
                                           Packed<my_float, my_int>(0.0, 3, 2));
  // benchmark packed
  auto tstart = high_resolution_clock::now();
  for (auto &elem : p1) // 1 memory access
    elem.comp();
  auto tend = high_resolution_clock::now();
  t += duration_cast<microseconds>(tend - tstart).count();

  std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
  std::cout << "N = " << N << ", packed: " << t << " us.\n";
#endif

  return 0;
}

with the corresponding shell (fish) script

g++ -Wall -std=c++11 -O3 transform.cpp -o transform_unpacked.out
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_packed.out -D_M_PACKED
for N in 1000 10000 100000
  echo "Testing unpacked for N = $N"
  ./transform_unpacked.out $N
  ./transform_unpacked.out $N
  ./transform_unpacked.out $N
  echo "Testing packed for N = $N"
  ./transform_packed.out $N
  ./transform_packed.out $N
  ./transform_packed.out $N
end

which gives the following:

Testing unpacked for N = 1000
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
Testing packed for N = 1000
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
Testing unpacked for N = 10000
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
Testing packed for N = 10000
p1[0].sinvl: 0.666667
N = 10000, packed: 17 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
Testing unpacked for N = 100000
sinvl[0]: 0.666667
N = 100000, unpacked: 64 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
Testing packed for N = 100000
p1[0].sinvl: 0.666667
N = 100000, packed: 180 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 198 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 177 us.

I hope I have understood the proper testing method, correctly. Still, though, the difference is 2-3 folds.

"*[S]hould* yield better performance" is almost always a [red herring](https://en.wikipedia.org/wiki/Red_herring). [Premature optimization](http://wiki.c2.com/?PrematureOptimization) is often bad. — Some programmer dude, Nov 01 '17 at 10:24
also your testing method is highly unreliable. You can't have multiple tests like that because simply running one test will influence the results of the following tests. This is because of data caching. You need to have 1 test per run and run the same test multiple times. — bolov, Nov 01 '17 at 10:28
Thank you for your time, @Someprogrammerdude. I am after whether "_[S]hould_ yield better performance" is a red herring or not. That's why I am asking whether packing the data in a structure in this specific case should be followed or why it is failing to give performance benefit. I can't see clearly why this is a premature optimization, though. This could have been an image processing application where the decision of choice could be either to store R, G, B values of a given pixel at three different containers or to bundle them together in structs and store them in a single container. — Arda Aytekin, Nov 01 '17 at 10:41
Thank you for your time, @bolov. Do you say that I should compile two different programs and run them (from a shell script) for each `N`, in this example, `M` times and get the mean of the timings later on? If so, I will do that and report the new timings. I hope this explains the huge difference, that is, `2.5x`. Thanks for the hint! — Arda Aytekin, Nov 01 '17 at 10:43
you can have 1 program with preprocessor macro or program arguments selecting the method and parameters. And yes, it helps you if you run them from a script. In your case you have 6 tests to run: 2 methods (3 vectors and 1 vector) x 3 sizes. Don't forget to run each test multiple times consecutively (3-5 times should be enough) and get an average. — bolov, Nov 01 '17 at 10:50
@Someprogrammerdude I must have missed the part where OP said that it was a premature optimisation. This looks like an entirely sensible micro-optimisation. The term “red herring” doesn’t make sense here. — Konrad Rudolph, Nov 01 '17 at 11:14
Maybe it is because in the `std::transform` case, compiler is able to emit SIMD instructions. Note: your solution should *not* yield better performance. Cache is able to provide the data in the same time for both solutions in your case. If you've used more data-streams (more than cache-ways), then it could be the case the `std::transform` variant could be slower. — geza, Nov 01 '17 at 12:10
I've checked: for GCC, there is not much difference (and it is much slower than clang). Clang uses SIMD for both cases, but for your `Packed.comp` case, it has to move data around to be able to use SIMD division. Maybe that's the cause of the difference. (but on my computer, with clang 6, the difference is only 60%, not 2x-3x) — geza, Nov 01 '17 at 12:35
Thank you, @geza, for your time and explanation. Actually, I have checked with clang now, and the behaviour is similar for me. Still, the performance difference is disturbing, but again, as you suggest, the behaviour _could_ turn to the packed storage's advantage in more complex situations than this one. When I built this simple example, I just checked `nm a.out | c++filt | grep comp` to see if the function call gets optimized out (which it does), but I do not know how to observe if that thing has to move data around. Want to provide an answer to the question that I can select? — Arda Aytekin, Nov 01 '17 at 12:58
I am impressed. You actually took our advice and followed through with an implementation. Good job. — bolov, Nov 01 '17 at 13:20
Let me introduce you to [Godbolt, the compiler explorer](https://godbolt.org/) I suggest you use it in the future — Mgetz, Nov 01 '17 at 13:40
Thank you, @bolov :) Well, I am asking for help and I have to follow any advice that helps me single out a problem from others. Thanks once more. — Arda Aytekin, Nov 01 '17 at 13:41
@Mgetz, actually I know about that website. Unfortunately, I do not know any assembly language. Maybe I should invest more time on getting the basics so that I can reason about the code better. I will try to get use of it in the future, I promise :) — Arda Aytekin, Nov 01 '17 at 13:43
redo the tests with `-mno-sse`. This will disable SSE instructions (SIMD). It will be interesting to see if this way packed is faster. — bolov, Nov 01 '17 at 13:51
@bolov, uhm `g++` gives `error: SSE register return with SSE disabled`, whereas `clang++` compiles properly. However, the output of `clang++ -mno-sse` compiled code is garbage --- I see `nan`s and zeros. — Arda Aytekin, Nov 01 '17 at 14:01
@ArdaAytekin According to the x86-64 ABI (https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r252.pdf) "Arguments of types float, double, _Decimal32, _Decimal64 and __m64 are in class SSE."; which means passing these as (by-value) arguments to a non-inlined function or returning them from such a function, requires SSE registers and therefore instructions. This makes ``-mno-sse`` tricky to use on this architecture. — Arne Vogel, Nov 01 '17 at 15:48
Wow! To be honest, I wasn't expecting this many perspectives when I asked the question. It has led to different fruitful discussions/comments regarding different areas of computations/programming. Thank you, @ArneVogel, for your comment on SSE's. — Arda Aytekin, Nov 01 '17 at 15:54
The assumption that having all data in one place decreases the cost of memory accesses is flawed. Packing data *can* be very helpful e.g. when accessing "cold" memory. Better to have one cache miss than multiple misses… However, in this particular use case, ``std::transform`` is traversing some medium to large vectors (two input, one output) sequentially, and modern CPUs can easily predict sequential accesses and pre-fetch the corresponding cache lines. There is a limit to the number of vectors, of course, but it's certainly no less than three… — Arne Vogel, Nov 01 '17 at 15:57
@ArneVogel, I can also see that when I need to use such a recursion that needs more than 2 input iterators, I will need to write either (1) a `for` loop based on the same `idx` accessing different memory locations, or (2) packing them in `struct`s like this and use a single memory access. This boils down to, I assume, what @geza was referring to --- SoA vs AoS. However, apparently, this minimal working example is too simple for these purposes :) Thanks, again, for the second explanation on access predictions of CPU's. — Arda Aytekin, Nov 01 '17 at 16:23

score 4 · Accepted Answer · answered Nov 01 '17 at 13:30

4

Here's the compiled loop of the std::transform case:

  400fd0:       f3 41 0f 7e 04 47       movq   xmm0,QWORD PTR [r15+rax*2]
  400fd6:       66 0f 61 c0             punpcklwd xmm0,xmm0
  400fda:       66 0f 72 e0 10          psrad  xmm0,0x10
  400fdf:       0f 5b c0                cvtdq2ps xmm0,xmm0
  400fe2:       f3 0f 7e 0c 43          movq   xmm1,QWORD PTR [rbx+rax*2]
  400fe7:       66 0f 61 c9             punpcklwd xmm1,xmm1
  400feb:       66 0f 72 e1 10          psrad  xmm1,0x10
  400ff0:       0f 5b c9                cvtdq2ps xmm1,xmm1
  400ff3:       0f 5e c1                divps  xmm0,xmm1
  400ff6:       41 0f 11 04 80          movups XMMWORD PTR [r8+rax*4],xmm0
  400ffb:       f3 41 0f 7e 44 47 08    movq   xmm0,QWORD PTR [r15+rax*2+0x8]
  401002:       66 0f 61 c0             punpcklwd xmm0,xmm0
  401006:       66 0f 72 e0 10          psrad  xmm0,0x10
  40100b:       0f 5b c0                cvtdq2ps xmm0,xmm0
  40100e:       f3 0f 7e 4c 43 08       movq   xmm1,QWORD PTR [rbx+rax*2+0x8]
  401014:       66 0f 61 c9             punpcklwd xmm1,xmm1
  401018:       66 0f 72 e1 10          psrad  xmm1,0x10
  40101d:       0f 5b c9                cvtdq2ps xmm1,xmm1
  401020:       0f 5e c1                divps  xmm0,xmm1
  401023:       41 0f 11 44 80 10       movups XMMWORD PTR [r8+rax*4+0x10],xmm0
  401029:       48 83 c0 08             add    rax,0x8
  40102d:       48 83 c1 02             add    rcx,0x2
  401031:       75 9d                   jne    400fd0 <main+0x570>

In each loop cycle, it processes 8 elements (there are two divps instructions, each does 4 divisions).

Here's the other case:

  401190:       f3 0f 6f 42 04          movdqu xmm0,XMMWORD PTR [rdx+0x4]
  401195:       f3 0f 6f 4a 14          movdqu xmm1,XMMWORD PTR [rdx+0x14]
  40119a:       66 0f 70 c9 e8          pshufd xmm1,xmm1,0xe8
  40119f:       66 0f 70 c0 e8          pshufd xmm0,xmm0,0xe8
  4011a4:       f2 0f 70 d0 e8          pshuflw xmm2,xmm0,0xe8
  4011a9:       66 0f 6c c1             punpcklqdq xmm0,xmm1
  4011ad:       66 0f 72 e0 10          psrad  xmm0,0x10
  4011b2:       0f 5b c0                cvtdq2ps xmm0,xmm0
  4011b5:       f2 0f 70 c9 e8          pshuflw xmm1,xmm1,0xe8
  4011ba:       66 0f 62 d1             punpckldq xmm2,xmm1
  4011be:       66 0f 61 ca             punpcklwd xmm1,xmm2
  4011c2:       66 0f 72 e1 10          psrad  xmm1,0x10
  4011c7:       0f 5b c9                cvtdq2ps xmm1,xmm1
  4011ca:       0f 5e c1                divps  xmm0,xmm1
  4011cd:       f3 0f 11 02             movss  DWORD PTR [rdx],xmm0
  4011d1:       0f 28 c8                movaps xmm1,xmm0
  4011d4:       0f c6 c9 e5             shufps xmm1,xmm1,0xe5
  4011d8:       f3 0f 11 4a 08          movss  DWORD PTR [rdx+0x8],xmm1
  4011dd:       0f 28 c8                movaps xmm1,xmm0
  4011e0:       0f 12 c9                movhlps xmm1,xmm1
  4011e3:       f3 0f 11 4a 10          movss  DWORD PTR [rdx+0x10],xmm1
  4011e8:       0f c6 c0 e7             shufps xmm0,xmm0,0xe7
  4011ec:       f3 0f 11 42 18          movss  DWORD PTR [rdx+0x18],xmm0
  4011f1:       48 83 c2 20             add    rdx,0x20
  4011f5:       48 83 c1 fc             add    rcx,0xfffffffffffffffc
  4011f9:       75 95                   jne    401190 <main+0x730>

In each loop cycle, it processes 4 elements (there is one divps instruction).

In the first case, data is in a good format, SIMD instructions can operate on them (almost) without any data-moving, and the result can be written easily (4 results are written with a single instruction).

In the second case, however, this is not the case. The compiler had to emit a lot of data-moving (shuffle) operations, and each result is written with a separate instruction. So the input/output is not in a SIMD friendly format.

I don't have the time to analyse this issue further, but if you just take the fact that both of this snippets has similar size, similar instructions, but the first processes twice as much as elements as the second one, you can have the idea why the second is slower. Sorry about the sloppy explanation.

answered Nov 01 '17 at 13:30

geza

28,403
6
61
135

No, not at all! Thank you, once more, for taking the time to explain the issue. At least, now, I can understand what's going on. When I asked the question, it did not make much sense. But now, I can reason that for problem setups where data locality is more important than this simple example, bundling things _could_ yield better performance. In my opinion, still, bundling things together in a coherent struct is better for more expressive and easy-to-debug code. At the same time, I _should_ simply remember that, for simple cases like this, I'd better choose the first type of solution. Cheers! – Arda Aytekin Nov 01 '17 at 13:38
1

@ArdaAytekin: That's a big topic actually :) Just to scratch the surface: https://en.wikipedia.org/wiki/AOS_and_SOA – geza Nov 01 '17 at 13:45
Well, I am _obviously_ not knowledgeable in the topic, but actually, this question popped out of a discussion on a project with a colleague of mine. I had watched Mike Acton's [talk](https://youtu.be/rX0ItVEVjHc) on CppCon 2014 before, and I have some mild experience on GPU programming and distributed computations. In the case which is mildly represented by this simple example, we were talking about the design decision. I will read some more on the topic, definitely. Thank you, @geza :) – Arda Aytekin Nov 01 '17 at 13:52
1

Even apart from the better SIMD compatibility of the "SoA" format, The idea that bundling would increase data locality isn't really accurate. Data locality is very good in either case: imagine your two values take 8 bytes each for 16 bytes total for the pair. That's 4 pairs in every cache line, if you pack themm in a struct or 8 individual elements if you keep them in separate arrays. So after you have processed 8 pairs, you'll be accessing two cache lines either way: first one cache line for 4 structs then the next for the struct case... – BeeOnRope Nov 04 '17 at 20:32
1

... or two cache lines at once for each element in the other. Then that repeats. Although you could argue that at a microscopic level the first pattern is slightly more local (one cache line at a time) accessing two cache lines at once isn't materially different to accessing one after another on any reasonable architecture, and this isn't what people mean when they say "locality of reference". In particular, prefetching can keep up with two streams at half the speed just about as well as one stream (and sometimes better, due to DRAM page effects). @ArdaAytekin – BeeOnRope Nov 04 '17 at 20:34
1

Also, you say _and a single memory access to the struct will give me all the variables I need_ - well not really: even if we ignore SIMD a compiler is going to use two separate accesses to get the two parts of the struct since it needs them in two different registers, which will be faster than a single read and the shift & masking to "extract" the values. It's not the "access" instructions themselves that are slow (any modern x86 can do 2 per cycle), it's the underlying cache misses, if any: and two versus one access to the same struct isn't changing that. – BeeOnRope Nov 04 '17 at 20:36
1

Finally, the data has is different sizes: 32-bit `float` versus 16-bit `int16_t`. Making a struct of a pair of these will need 8 bytes due to alignment, so 2 bytes are wasted. So "packing" them together ultimately ends up needing more bandwidth and having less locality of reference because there are 2 wasted bytes for every 6 payload bytes. The separate arrays don't have this issue. – BeeOnRope Nov 04 '17 at 20:37
Hey @BeeOnRope. Thank you very much for the insightful explanations. Now, I am not even skeptical --- I should simply avoid packing things in structs, or? Apparently, I need to do some more research/reading on the topic. So when should I use these type of techniques? Is there some specific case, or shall I just profile the code on the specific architecture once the code is totally done, and to these types of micro-optimizations? – Arda Aytekin Nov 15 '17 at 08:48
@ArdaAytekin - well whether to use SoA vs AoS is a whole topic by itself, but safe to say the vast majority of code isn't going to be performance critical and subject to this kind of vectorization, so usually you should just do what's most natural, which is almost always just grouping fields together in a structure as you have done (AoS). Only once you've profiled your application and know the hotspots, then consider data arrangements that might be less natural and have higher engineering and maintenance overhead, but are faster. – BeeOnRope Nov 15 '17 at 18:01
One problem with SoA is that it tends you have a viral effect: if you need SoA in one place for performance, you may be forced to use it in other places where it wouldn't otherwise be needed because it's too slow to transform between the formats. If you are lucky there there is a natural separation point, allowing you to keep most of your code clean, but this isn't always the case. See also [this question](https://stackoverflow.com/questions/40163722/is-my-understanding-of-aos-vs-soa-advantages-disadvantages-correct). – BeeOnRope Nov 15 '17 at 18:02
Thank you, once more, @BeeOnRope. I have bookmarked that question, too. As I stated in the question, I am not coming from a CS background, nor am I a full-time developer. I wasn't knowledgeable about this topic. When I wanted to ask between the two, I hadn't thought that this was on its own that big of a topic, to be honest :) – Arda Aytekin Nov 15 '17 at 18:07
FWIW even most people coming from a CS background won't really be aware of the SoA vs AoS distinction. You'd mostly hear about it in high performance computing, and gaming (where it is especially relevant for GPUs). Mostly you don't have to worry: just use structures as usual, and profile :) – BeeOnRope Nov 15 '17 at 18:42
@ArdaAytekin: here's the rule of thumb I use: If the code has similar complexity for SoA and AoS, and I expect (from experience) that SoA will be faster, then I'll use SoA. Otherwise I use AoS, and I may convert the code to SoA later, if the profiler say so. For example, for a particle system, where each particle has a lot of attributes (position, velocity, size, color, etc.), and each algorithm works on just a small set of attributes, then I'll use SoA, because it is more cache friendly, and code complexity is the same for SoA/AoS. – geza Nov 15 '17 at 19:05

score 3 · Answer 2 · 2017-12-23T04:52:27.180

[...] and collecting them inside a container should yield better performance.

I think your intuition is a little bit backwards for sequential access cases. Just considering sequential loops on inputs of non-trivial size, SoA is almost always going to be as fast or faster than AoS for sequential access.

In many cases the SoA will actually net fewer cache misses total than AoS because it avoids structure padding (doesn't face the alignment requirements of interleaving non-homogeneous fields) and you might be able to avoid loading cold fields for a given loop (ex: a physics simulator might be able to avoid loading the color field of a particle, used only for rendering, when applying physics). Your case doesn't benefit from either of these but it's something to keep in mind.

Where AoS tends to excel is for random access. In that case if you load, say, the 4096th element, you might only require a single cache line with the AoS to get at all three fields. If you use SoA, you would require 3 and it might load in a lot of data for neighboring elements (data for element 4097, 4098, 4099, etc), none of which gets used prior to eviction due to the random-access nature of the memory access pattern. But for sequential access, all such neighboring data will generally be used even with the SoA rep prior to eviction.

So in terms of cache misses in cases where you're sequentially looping over non-trivial input sizes, SoA will generally either tie or win.

But furthermore in such sequential access patterns where all neighboring data will be used prior to eviction, when you consider the dynamics when the data is loaded from cache to SIMD register, the SoA provides homogeneous data. It can load memory in the form of, say, float, float, float, float, ... and int16, int16, int16, int16, ..., not float, int16, int16, float, int16, int16 ... with vertical/symmetrical operations being performed across SIMD registers. Such cases tend to offer far more optimization opportunities for both human and compiler, and that's likely where your particular case is benefiting the most as Geza points out.

Finally, is there a way to beat std::transform in this example, or, is it simply good enough to be a go-to solution?

I think trying to beat std::transform while fulfilling the same requirements is the wrong way of looking at it, but you might be able to squeeze some performance gains by changing them a bit. For example, instead of std::divides, you might be able to prep the data in advance to turn that into a multiplication. The biggest thing I'd look for in your case is to see if the code can be executed in parallel against the faster SoA ("unpacked") rep. You might be able to process a given index range of each SoA container in each separate thread, still using std::transform in each one.

Thank you @DrunkCoder, for your comprehensive answer. That clarified a lot of things, especially when it comes to the dynamics involved with SoA and SIMD operations. However, as for your last paragraph, I am afraid we cannot prepare the data to change division to multiplication, nor are we allowed to benefit from parallelism. Yet, I thank you once more for providing a thorough response considering different perspectives all at once! Cheers and happy holiday period! :) — Arda Aytekin, Dec 25 '17 at 07:42
Cheers! The SoA rep (parallel arrays, i.e.) tends to excel if you use sequential access patterns and the overhead of maintaining these parallel vectors/arrays isn't a hotspot so much as looping through the data sequentially. So I'd keep that SoA representation in your case and maybe see if you can spot other optimization opportunities if needed -- but otherwise a sequential loop doing arithmetic through SoAs is already pretty good if the algorithm cannot be applied in better than linear-time complexity. Conceptually you can't really beat that short of very micro-level optimizations, ... — , Dec 25 '17 at 08:15
... finding opportunities to pre-prep the data if possible for faster processing, or sacrificing some precision of the results in favor of speed.... or using a totally different piece of hardware like GPU if applicable. — , Dec 25 '17 at 08:17
As for when you should and shouldn't use AoS ("packed"), typically I'd lean towards the side of the AoS since it tends to yield more straightforward code that's easier to maintain and is reasonably well-suited for all kinds of access patterns. However, if you know in advance or in hindsight that you're going to be spending the bulk of your performance-critical loops just sequentially looping over large amounts of data, that's when SoA reps can be handy (even though the resulting code can be a bigger PITA to maintain). — , Dec 25 '17 at 08:25
As an example, I would reach for SoAs upfront with particle simulators, since they spend so much time just sequentially looping through each particle: `for each particle, move it around` and the number of particles can span in the millions. So I don't even bother with AoS reps anymore in those cases since it can be a pain to rewrite AoS reps to SoA or vice versa. I've always found the SoA rep to outperform the AoS in those cases enough times after profiling to just go with an SoA upfront when I anticipate a strong requirement for fast sequential processing. — , Dec 25 '17 at 08:28
Another thing to think about is memory usage. For sequential processing, often the less memory required to process, the faster it goes. With AoS reps, you end up wasting a little bit of memory with structure padding if subsequent fields require a different alignment than they would normally have if one data field immediately followed the next in memory. As an example, a 32-bit `float` followed by a `char` followed by another `float` would waste 3 bytes, since `char` only requires 1-byte alignment while `float` tends to require 4, so to align that 3rd data field, the compiler would add 3 bytes. — , Dec 25 '17 at 08:31

Understanding std::transform and how to beat it

2 Answers2