1

Which is the fastest? A boost::multi_array or a std::vector? I will have (not constant) 17.179.869 elements stored in 3 dimensions which will need to be accessed inside a for loop very rapidly and very often. What would be the most performing? An std::vector or a boost::multi_array?

(I don't expect it to be done within a second, but I would like to have it as efficient as possible because a nanosecond difference can save a lot of time.)

Jeroen
  • 15,257
  • 12
  • 59
  • 102
  • 4
    If you care, implement both and measure.... – Tony Delroy Jul 26 '13 at 17:36
  • Depends on how you are iterating, and if you are iterating over POD. std::vector, as a flattened array, will most likely be the fastest iff you are iterating over the smallest dimension and iterating using a pointer rather than the [] or iterator operators. It really won't make that much of a difference if you are doing any significant amount of processing on the elements. – IdeaHat Jul 26 '13 at 17:37
  • @MadScienceDreams: I believe that [Boost.MultiArray](http://www.boost.org/doc/libs/1_54_0/libs/multi_array/doc/reference.html) are also stored in a single contiguous block. From the synopsis of the `data` accessor: *This returns a pointer to the beginning of the contiguous block that contains the array's data. [...]* – Matthieu M. Jul 26 '13 at 18:24
  • @MadScienceDreams Using a pointer instead of plain [] increased the load time from 61 seconds to 188 seconds. – Jeroen Jul 26 '13 at 18:40
  • @Binero That is...odd. If you get the pointer once (i.e. ptr=&arr[0]) and then increment the pointer during iteration (i.e. ptr++) than the performance at worst should be about the same, but pre-caching should make it an insignificant amount better. – IdeaHat Jul 26 '13 at 19:07
  • @MatthieuM. yeah, its the different in how you iterate. With a multiarray M[i][j][k] there is a hidden i*size(2)*size(1)+j*size(1)+k on each call, while if you iterate the pointer (++) and make sure the pointer is stored in a register, it should be faster, but the speed up you get from this will probably be insignificant compared to the computation time on the index. – IdeaHat Jul 26 '13 at 19:12
  • @MadScienceDreams: Not necessarily. 1/ Boost.MultiArray has slices built-in, so you can take a slice in the outer loop and thus iterate over the slice with `slice[k]` in the inner loop; 2/ In release mode, I would expect the compiler to see through and hoist as much stuff as possible outside the loops, it's such an obvious implementation that it's been implemented for ages, of course though this would have to be checked (just to be on the safe side). – Matthieu M. Jul 27 '13 at 09:26

3 Answers3

3

Best advice is to benchmark it by yourself.

In any case, since you seem to have constant size there are other solutions:

  • plain C arrays (eg. int data[X][Y][Z])
  • plain one dimensional C array in which you compute indices by yourself, eg X*W*H + Y*W + Z, can be handy in some situations
  • std::array, which is basically a C++ array with some synctactic sugar taken from STL collections
  • std::vector, which I guess is the first solution that can be tried
  • boost::multi_array, which is meant to support N dimensional arrays so it can be overkill for your purpose but probably has a better locality of data compared to a vector.
Jack
  • 131,802
  • 30
  • 241
  • 343
  • The problem is that the size isn't constant, that's just the amount of values that I'm testing with, which is 1/8th of the total size. – Jeroen Jul 26 '13 at 17:40
  • 2
    Did you mean `boost::multi_array` (in your last bullet point) instead of `std::multi_array`? The latter doesn't exist. – Cassio Neri Jul 26 '13 at 17:53
  • Personally I would suggest either `std::array` or `std::vector` from the above so that you don't have to worry about memory and all the nasty things that can happen with that. Along with this, I would suggest using the indexing strategy in the second bullet point – wlyles Jul 26 '13 at 18:08
  • @wlyles The indexing strategy in the second point would take up too much performance as I need the current dimension in every single loop. – Jeroen Jul 26 '13 at 18:24
  • @Binero: you can of course cache part of the computation if some of the indexes are not moving! – Matthieu M. Jul 26 '13 at 18:25
  • That'd require an if statement that gets called over a milliard / billion times every time I loop it. The gain you get from caching only works when you have multiple dimensions. – Jeroen Jul 26 '13 at 18:29
  • @MatthieuM. And I do actually cache stuff in low level dimensions. – Jeroen Jul 26 '13 at 19:27
2

Those library vector classes are designed to be easy to use and relatively fail-safe. They are as fast as they can be within their design, but nothing can beat doing it yourself (except maybe hand-coded assembly). For the size you're talking about (2e10 elements), I would be more concerned with efficiency than with user-friendliness. If your innermost loop does very little computation per element, you're going to find the indexing calculations dominant, which suggests doing some unrolling and pointer-stepping. (Maybe you can count on the compiler to do some unrolling, but I don't care for maybes.)

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
2

The only way to know for sure is to try both and profile the code. However as a bunch of ideas, this is what I think you'll find.

  1. For a large number of elements that you are dealing with (2e10+) the access to elements is not going to be as significant as the cache pressure to load those elements into the cpu cache. The prefetcher is going to be sitting there trying to load those elements, which is going to take a large proportion of the time.
  2. Accessing 2(or 3D) non-contiguous C arrays means the CPU has to go around fetching things from different parts of memory. boost::multi_array solves that somewhat by behind the scenes storing it as a contiguous block; but it has it's own overheads for doing so. As @Jack said, plain 1D arrays with indices are best, and even then you can do things to ensure the indexing is minimal.(eg memoization)
  3. The work you do within the loop is going to affect your timings significantly. The branch predictor is going to be the biggest contributor. If it's a simple math operation, no if/else statements you'll likely get the best performance, and the compiler will likely optimise it to SSE instructions. If you have composite types (rather than int/float/char) then you're going to have to lay them out right to optimise access.
  4. I would almost suggest, try both, then come back with a new SO question that has your loop written and ask how to optimise that part. Almost always, the compiler can be given hints to ensure it knows your intentions.

At the end of the day, try it and see

Delta_Fore
  • 3,079
  • 4
  • 26
  • 46