Threading Building Blocks (TBB) library provides two functions for performing reduction over a range:
Which one of two shall be selected if I want to perform the reduction as fast as possible, but still get exactly the same answer independently on hardware concurrency and the load from other processes or threads? I am basically interested in two scenarios:
- Computing the sum of elements in integer-value vector.
- Computing the sum of elements in floating-point-value vector.
And a side question. On the page about parallel_deterministic_reduce
there is one warning:
Since
simple_partitioner
does not automatically coarsen ranges, make sure to specify an appropriate grain size
Does it mean that the call to parallel_deterministic_reduce
with a range having no explicitly specified grain size will lead to poor performance? How grain size shall be set then?