Why has std::reduce been added in C++17?

Question

I was looking for a thorough explanation of the meaning of "Return value" description for std::reduce, which according to cppreference.com, is:

Maybe after I properly understand this section, I can better determine when I should choose std::reduce over std::accumulate.

"in other words, reduce behaves like std::accumulate except the elements of the range may be grouped and rearranged in arbitrary order" — Justin, Nov 06 '17 at 19:14
It basically only exists so that we can compute addition in parallel. — AndyG, Nov 06 '17 at 19:15
`std::accumulate` is a left-fold, guaranteed. That's not something that's parallelizable. `std::reduce` can be parallelized because you can compute `(a + b) + (c + d)` instead of `((a + b) + c) + d` — Justin, Nov 06 '17 at 19:16
I would think the `ExecutionPolicy` parameter has a lot more to do with the algorithm's existence than the return value does... >_> — ildjarn, Nov 06 '17 at 19:17
That statement means: std::accumulate works from left to right, std::reduce may work in arbitrary order. This requires the operation to be commutative and associative for stable results, but allows parallel execution. — peterchen, Nov 06 '17 at 19:17
why they didn't just call these things `foldl` and `fold` I'll never understand. — experquisite, Nov 20 '17 at 15:59
Left folds for associative non-commutative operations are paralellizable, but not to the same degree as an associative commutative operation. Reasons are related to memory access, at least. Left folds are still important. — cwharris, Jun 11 '20 at 02:58

score 95 · Accepted Answer · edited Jun 08 '18 at 08:29

Since you asked for a thorough explanation, and the previous answer covers only basics, I'm taking the liberty of adding a more thorough one.

std::reduce is intended to perform the second major step of the MapReduce programming model. The basic idea is that the platform (in this case, C++ implementation) provides these two primitive operations map and reduce, and the programmer provides callback operations for each of the two that perform the "actual work".

Basically, the callback for the map operation maps an input value to an intermediate value. The callback for reduce combines two intermediate values into one intermediate value. The last intermediate value left becomes the output of the whole MapReduce. It seems to be a very restrictive model per se, but still, it has a large range of applications.

The platform has to do more, of course, such as "shuffling" (distributing the inputs, usually in groups, to different processing units) and scheduling, but this is of little concern to the application programmer.

By the way, in the C++ standard library, the "map" operation is called transform. It has received parallelism support in C++17 as well, but I'll get into parallelism later.

Here's a made-up example: Let's say we have a function that converts an integer to a string representation. Now, given a list of integers, we want the textual representation containing the greatest ratio of consonants to vocals. E.g.

Input: 1, 17, 22, 4, 8
Output: twenty-two

(Try it for yourself if you don't believe this result.)

We could use MapReduce here by using our int-to-text function as the callback to map (rsp. std::transform), and a function that counts the number of consonants and vocals and then selects either the left hand or right-hand argument accordingly. There are some inefficiencies here, in particular, the "ratio" should be cached, but these are details.

Now your question may and possibly should be: Why should I possibly care? After all, so far you didn't gain much from using e.g. std::transform and std::reduce in this way, and you could have used std::accumulate in place of the latter as well. The answer of course, given a sufficiently large number of input values, is execution policies – the former two standard function templates have overloads that allow implicit parallelism. But that still begs the question why one would use MapReduce and not a thread pool or std::async, and a bunch of hand-written loops? First, especially for "pure" computations on large vectors or other containers, with no I/O, it is often more convenient to write the two MapReduce callbacks because you don't have to deal with all the details of how the input data is spread around to different threads and then combined.

Second, MapReduce encourages a discipline of structuring your computations in a way that can be parallelized very effectively. Of course, in a programming language that supports the imperative paradigm, such as C++, you can still mess things up by locking a bunch of mutexes, or whatever way you may have of interfering with other threads. But the MapReduce paradigm encourages writing independent mapping callbacks. As a simple rule of thumb, if these tasks share data at all, then it should be read-only so that copies of it can be stored in multiple CPU caches at the same time. Well-written tasks, combined with an efficient platform implementation of the underlying algorithms, can scale to hundreds or even thousands of CPU cores, as is shown by the MapReduce platforms already in common use (like Apache Hadoop, but take this only as a necessary example and not as gratuitous advertisement).

But the question was mainly about std::reduce – we could still perform this highly scalable mapping and then run std::accumulate on the intermediate results, right? And this is where we get to what François Andrieux previously wrote. accumulate performs what mathematicians call a left fold. If you view the operations and their operands as a binary tree, then this would be a left-leaning tree, e.g. to add 1, 2, 3 and 4:

   +
  / \
  + 4
 / \
 + 3
/ \
1 2

As you can see, the result of each operation is one of the arguments of the next operation. This means there is a linear chain of data dependencies, and that is the bane of all parallelism. To add a million numbers, you need a million consecutive operations, which will block a single core, and you cannot make any use of additional cores. They will have nothing to do most of the time, and the communication overhead will greatly outweigh the cost of the computations. (It's actually worse than that because modern CPUs can perform multiple simple calculations per clock, but this doesn't work when there are so many data dependencies, so most of the ALUs or FPUs go unused.)

By lifting the restriction that a left fold must be used, std::reduce allows the platform to make more efficient use of computational resources. Even if you use the single-threaded overload, the platform could, for example, use SIMD to add a million integers in much less than a million operations, and the number of data dependencies will be greatly reduced. A 10x speedup on a well-written integer addition reduce would not surprise me. Granted, this particular special case could probably be optimized under the as-if rule because the C++ implementation "knows" that integer addition is (almost, see below) associative.

But reduce goes further than that, as was mentioned, by supporting execution policies, i.e. in most cases multi-core parallelism. In theory, a balanced binary tree of operations could be used. (Remember that a tree is balanced if the depth is either less than two, or the depth of the left subtree is different from the depth of the right subtree by at most 1.) Such a tree has only logarithmic depth. If we have a million integers, the minimum tree depth is 20, so – theoretically – given enough cores and no communication overhead, we could achieve a factor 50,000 speedup over even the optimized single-threaded calculation. Of course, in practice, that's a load of wishful thinking, but we can still expect large speedups.

All that said, let me add a quick disclaimer/reminder that performance is not the same as efficiency. Using 64 cores for 100ms each means much higher performance than using one core for 1,000ms, but much less CPU efficiency. Another way to put it is that performance is efficiency in the sense of minimizing the elapsed time, but there are other measures of efficiency – total CPU time used, RAM used, energy used, and so on. The primary motivation of parallel MapReduce is to provide higher performance. Whether it reduces CPU time or energy consumption is unclear, and it will very likely increase peak RAM usage.

To top it all off, here are some caveats. As was mentioned, reduce is non-deterministic if the operations are not associative or not commutative. Fortunately, the most important arithmetic operations such as addition and multiplication are associative and commutative, right? We all know that integer and floating-point addition, for example, have both of these properties. And of course, I am being facetious. In C++, neither signed integer addition nor floating-point addition, are associative. For floating-point numbers, this is due to possible differences in rounding of intermediate results. This is easily visualized if we, as an example, pick a simple decimal floating point format with two significant digits, and consider the sum 10 + 0.4 + 0.4. If this is done by the normal C++ syntax rules as a left-fold, we get (10 + 0.4) + 0.4 = 10 + 0.4 = 10 because each time the result is rounded back down to 10. But if we do it as 10 + (0.4 + 0.4), the first intermediate result is 0.8, and 10 + 0.8 is then rounded up to 11. Also, rounding errors can become greatly magnified by a high depth of the operation tree, so doing a left fold is actually one of the worst things one could do when it comes to accuracy. There are several ways to deal with this behavior, ranging from sorting and grouping the inputs to using an increased intermediate precision, but when it comes to reduce there may simply be no way to get 100% run-for-run consistency.

The other, possibly more surprising, observation is that signed integer addition is not associative in C++. The reason for this is the possibility of overflow, to put it bluntly: (-1) + 1 + INT_MAX. According to the normal syntax rules, or accumulate, the result is INT_MAX. But if you use reduce, this might get rewritten as (-1) + (1 + INT_MAX) which contains an integer overflow and hence undefined behavior. In fact, because reduce may arbitrarily change the order of operands, this is even true if the inputs are { INT_MAX, -1, 1 }.

My recommendation here is to ensure that the callback to reduce cannot produce an overflow. This could be done by limiting the allowed range of inputs (e.g. if you add 1000 ints, make sure that none of them is greater than INT_MAX / 1000 or less than INT_MIN / 1000, rounded up), for example, or, equivalently, by using a larger integer type, or, as an absolute last resort (because this is both expensive and hard to handle correctly), putting additional checks in the reduce callback. In most practical cases, reduce is only marginally less safe regarding integer overflow than accumulate, however.

Thanks for the very thorough explanation! But let's say I want to map a list of strings to uppercase then concatenate them. `reduce` wouldn't mixed up the ordering, would it? The only difference between using reduce and accumulate in this example would be having the luxury of parallelism, correct? — Aria Pahlavan, Nov 09 '17 at 17:07
@AriaPahlavan no, string concatenation is one of the go-to examples of an operation that is associative but not commutative. "Hello" + " World" is distinctly different to " World" + "Hello" — Caleth, Nov 09 '17 at 17:32
Indeed, the important part is that the wording allows ``reduce`` to use "any permutation of" the input values. One could annotate the inputs with an index and flip the arguments of the concatenation if necessary, but the word "shoe-horning" comes to mind. A much cleaner solution would be to also have a ``std::stable_reduce`` with theoretically a bit less performance but not requiring commutativity. Maybe I'll even get around to writing a proposal… As a side note, I discovered that they added a convenient algorithm ``std::transform_reduce``. Doesn't help with non-commutative operations, though. — Arne Vogel, Nov 09 '17 at 18:44
Well, I just realized that my suggested workaround doesn't work either: E.g. the inputs ``{ "Hello"s, "Dear"s, "World"s }`` may get reordered to ``{ "Hello"s, "World"s, "Dear"s }`` and then if ``"Hello"s`` and ``"World"s`` are concatenated first, ``"Dear"s`` will be misplaced on either side of the result. OTOH, this does help to emphasize that a ``stable_reduce`` would be much more appropriate here. — Arne Vogel, Nov 09 '17 at 19:59
Re: "...ratio of consonants to vocals" - I assume you mean **vowels** not **vocals**. — Will, May 26 '23 at 06:09

François Andrieux · Answer 2 · 2017-11-06T19:28:37.537

43

std::accumulate iterates over the container in order where as std:reduce might not. Because the order is not guaranteed, std::reduce introduces extra requirements :

The behavior is non-deterministic if binary_op is not associative or not commutative. The behavior is undefined if binary_op modifies any element or invalidates any iterator in [first; last], including the end iterator.

However, std::reduce provides overloads that support parallelization which are not available with std::accumulate. std::reduce allows you to automatically parallelize your operation, provided it meets the requirements mentioned above.

edited Nov 06 '17 at 19:28

answered Nov 06 '17 at 19:16

François Andrieux

28,148
6
56
87

So if the most general version of `std::reduce` is used (i.e. `T reduce(InputIt, InputIt, T);`), then the "iteration over the container will remain ordered? – Aria Pahlavan Nov 06 '17 at 19:30
6

All overloads may alter the order of the container. – François Andrieux Nov 06 '17 at 19:38

score 12 · Answer 3 · answered Dec 29 '18 at 22:54

Allowing parallelism is the main reason for addition of std::reduce
Also one needs make sure operation that you want to use with std::reduce is both associative and commutative.
- For example, Addition is associative and gives the same results when accumulation is done in parallel using std::reduce.
  100 + 50 + 40 + 10 = 200 (100 + 40) + (50 + 10) = 200
- But subtraction not being associative, std::reduce may give wrong results.
  100 - 50 - 40 - 10 = 0 NOT SAME AS (100 - 40) - (50 - 10) = 20
Efficiency
std::vector<double> v(100000, 0.1); double result = std::accumulate(v.begin(), v.end(), 0.0); double result = std::reduce(std::execution::par,v.begin(),v.end()) //Faster

Is there a difference if you use std::execution::par_unseq instread of std::execution::par? As the accepted answer pointed out the reduce is non-deterministic. Does par make it deterministic? — hof.and.or, Mar 12 '21 at 19:38

Why has std::reduce been added in C++17?

3 Answers3