How to parallelize std::partition using TBB

Question

Does anyone have any tips for efficiently parallelizing std::partition using TBB? Has this been done already?

Here is what I'm thinking:

if the array is small, std::partition it (serial) and return
else, treat the array as 2 interleaved arrays using custom iterators (interleave in cache-sized blocks)
start a parallel partition task for each pair of iterators (recurse to step 1)
swap elements between the two partition/middle pointers*
return the merged partition/middle pointer

*I am hoping in the average case this region will be small compared to the length of the array or compared to the swaps required if partitioning the array in contiguous chunks.

Any thoughts before I try it?

If you're using gcc, you could start with `-fopenmp` and defining `_GLIBCXX_PARALLEL` to use the parallel versions of the standard library that have already been written, tested, etc. If you really want to *only* run `std::partition` in parallel, you can include `parallel/algorithm`, and call `__gnu_parallel::partition`. — Jerry Coffin, May 29 '14 at 00:13
this needs to complile with gcc and clang. also, the code doing the partition will already be running in a tbb task, will the openmp scheduler and the tbb scheduler play nicely together? — atb, May 29 '14 at 00:41
I'd *expect* them to have put some effort into making them play nicely together, but I haven't tested it, so it's hard to be sure. — Jerry Coffin, May 29 '14 at 01:13
No, OpenMP usually creates quadratic oversubscription if called inside outermost parallel loop (of TBB or smth else). It's just up to OpenMP specification. The level of oversubscription can be reduced via OMP_DYNAMIC mode, but still nesting two different runtimes is inefficient. — Anton, May 29 '14 at 04:26
BTW, if you don't want to depend on GCC's extensions but are ok to use GPL3 code, you can just bundle their implementation. You can also convert their code to TBB if you really want to avoid OpenMP. It looks to me like they used divide and conquer generalized to arbitrary thread count. https://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a00973_source.html#l00056 — Adam, May 29 '14 at 07:00
It's extremely hard to translate an OpenMP program that uses plain parallel regions and barriers (thus relies on the number of threads) to TBB since TBBt is all about hiding threads and optional parallelism. TBB has no barriers, it's outside the law there. Though, the GCC's implementation looks like stretching of inherently recursive algorithm onto flat OMP region. So, if one can understand it and translate it natively to TBB, it looks promising approach. — Anton, May 29 '14 at 13:13
The interleaved arrangement is is likely to create a lot of false sharing conflicts (i.e. send cache-lines bouncing around) unless each item occupies a whole cache line. — Arch D. Robison, May 30 '14 at 18:51
Here are some results from testing __gnu_parallel::partition on a std::array of 1,000,000,000 pointers: https://software.intel.com/sites/default/files/managed/2d/25/gnu_parallel_partition.jpg — atb, Jun 02 '14 at 21:41

score 3 · Answer 1 · answered May 30 '14 at 17:28

I'd treat it as a degenerate case of parallel sample sort. (Parallel code for sample sort can be found here.) Let N be the number of items. The degenerate sample sort will require Θ(N) temporary space, has Θ(N) work, and Θ(P+ lg N) span (critical path). The last two values are important for analysis, since speedup is limited to work/span.

I'm assuming the input is a random-access sequence. The steps are:

Allocate a temporary array big enough to hold a copy of the input sequence.
Divide the input into K blocks. K is a tuning parameter. For a system with P hardware threads, K=max(4*P,L) might be good, where L is a constant for avoiding ridiculously small blocks. The "4*P" allows some load balancing.
Move each block to its corresponding position in the temporary array and partition it using std::partition. Blocks can be processed in parallel. Remember the offset of the "middle" for each block. You might want to consider writing a custom routine that both moves (in the C++11 sense) and partitions a block.
Compute the offset to where each part of a block should go in the final result. The offsets for the first part of each block can be done using an exclusive prefix sum over the offsets of the middles from step 3. The offsets for the second part of each block can be computed similarly by using the offset of each middle relative to the end of its block. The running sums in the latter case become offsets from the end of the final output sequence. Unless you're dealing with more than 100 hardware threads, I recommend using a serial exclusive scan.
Move the two parts of each block from the temporary array back to the appropriate places in the original sequence. Copying each block can be done in parallel.

There is a way to embed the scan of step 4 into steps 3 and 5, so that the span can be reduced to Θ(lg N), but I doubt it's worth the additional complexity.

If using tbb::parallel_for loops to parallelize steps 3 and 5, consider using affinity_partitioner to help threads in step 5 pick up what they left in cache from step 3.

Note that partitioning requires only Θ(N) work for Θ(N) memory loads and stores. Memory bandwidth could easily become the limiting resource for speedup.

Thank you, this is a very good suggestion. I was also worried about memory bandwidth which is why I was thinking about the interleaved partition idea. I was hoping it might save a lot of copying during the merge step. If I interleave in cache-sized blocks maybe I can avoid the false sharing problem too. I'm going to code it up on Monday, it should only require implementing special interleaved iterators and using them in parallel with std::partition...doesn't seem too hard. — atb, May 31 '14 at 12:40
I hadn't seen the affinity_partitioner before, it looks useful, thanks. — atb, May 31 '14 at 12:50

Anton · Answer 2 · 2014-05-29T15:20:22.207

Why not to parallel something similar to std::partition_copy instead? The reasons are:

for std::partition, in-place swaps as in Adam's solution require logarithmic complexity due to recursive merge of the results.
you'll pay memory for parallelism anyway when using the threads and tasks.
if the objects are heavy, it is more reasonable to swap (shared) pointers anyway
if the results can be stored concurrently then threads can work independently.

It's pretty straight-forward to apply a parallel_for (for random-access iterators) or tbb::parallel_for_each (for non-random-access iterators) to start processing the input range. each task can store the 'true' and 'false' results independently. There are lots of ways to store the results, some from the top of my head:

using tbb::parallel_reduce (only for random-access iterators), store the results locally to the task body and move-append them in join() from another task
use tbb::concurrent_vector's method grow_by() to copy local results in a bunch or just push() each result separately on arrival.
cache thread-local results in tbb::combinable TLS container and combine them later

The exact semantics of std::partition_copy can be achieved by copy from the temporary storage from above or

(only for random-access output iterators) use atomic<size_t> cursors to synchronize where to store the results (assuming there is enough space)

My "interleaved partition" idea was an attempt to get around the logarithmic complexity of the merge. Ignoring worst cases, two interleaved slices of an array should typically have similar value distributions. If so, the two resulting partition/middle pointers should end up near each other. The end sections of the original array (begin->middle_low, middle_high->end) will be correctly partitioned and won't need any merging. Only the elements in the middle range (middle_low->middle_high) would need swapping. — atb, May 29 '14 at 16:35
@atb, if objects are aligned to the cache line size (for array/vector..) or allocated in separate cache lines (for list) it will not cause (much) false sharing. Otherwise, if the objects are tightly placed in an array, I'm afraid that interleaving access will cause extensive false sharing penalties, so the 'constant' complexity will not help because of too big constant. Interleaving may be useful for read-only structures. — Anton, May 29 '14 at 18:04

score 0 · Answer 3 · answered May 28 '14 at 23:51

0

Your approach should be correct, but why not follow the regular divide-and-conquer (or parallel_for) method? For two threads:

split the array in two. Turn your [start, end) into [start, middle), [middle, end).
run std::partition on both ranges in parallel.
merge the partitioned results. This can be done with a parallel_for.

This should make better use of the cache.

answered May 28 '14 at 23:51

Adam

16,808
7
52
98

i didn't like this approach because i thought it would require too much swapping during the merge. on average half the array may need to get swapped. merging interleaved arrays is much simpler because the order in the partitioned segments is arbitrary so most of the elements can be left alone, only the middle of the array needs to be cleaned up. but you're right, it might not be cache friendly... – atb May 28 '14 at 23:59
1

Yes, but that swap is very fast. The interleaving guarantees non-consecutive elements that reduce your cache hits. I don't know your plan to implement a striding iterator, but the only approaches I can think of will make the code uglier and make the compiler's job of optimizing your code harder. A work-efficient approach is not always the fastest. You asked for opinions, and that's mine. – Adam May 29 '14 at 00:12
if the merge is needed, `parallel_reduce` fits better than `parallel_for` – Anton May 29 '14 at 04:29
@Anton huh? You have two ranges and you need to swap their contents. How is that a reduction? – Adam May 29 '14 at 05:05
If you need two subranges how do you get them if not in parallel_reduce? Join operation of it guarantees that no other thread processes the same subrange – Anton May 29 '14 at 05:17
@Anton a reduce takes a range and reduces it to a single value. How is that useful for the merge here? When we say merge, we mean we want to combine two partitioned lists into a single partitioned list. – Adam May 29 '14 at 06:43
what you need is a parallel `std::swap_ranges` (i.e. a parallel_for). – Adam May 29 '14 at 07:12
Ah, if you want to apply parallel_for only for swap_ranges, this is ok (for random-access iterators). I thought about more high-level approach. E.g. divide-and-conquer is done using parallel_reduce (note, it's not only about reducing of something, it's generally about binary splitting and joining of tasks in parallel), and swap_ranges is done using nested parallel_for. The join phase is exactly point (3) in your answer. point 2 is recursive, thus there will be a lot of overlapping subranges on the join phase (for consequent joins). – Anton May 29 '14 at 13:50
@Anton "divide-and-conquer is done using parallel_reduce" I think you have that backwards. If not, can you provide a link? – Adam May 29 '14 at 18:43
@Adam, sorry, link to what? I think documentation is clear enough, just pay attention to `Body::Body( Body&, split )` and `Body::join( Body& rhs )` and the following description. And if you have a specific question, please ask it separately (here or on the TBB forum) – Anton May 30 '14 at 04:58
None of which applies to the merge step. But thanks for the condescending tone. Goodbye. – Adam May 30 '14 at 05:18

score 0 · Answer 4 · answered May 28 '14 at 23:59

0

It seems to me like this should parallelize nicely, any thoughts before I try it?

Well... maybe a few:

There's no real reason to create more tasks than you have cores. Since your algorithm is recursive, you also need to keep track not to create additional threads, after you reach your limit, cause it'll just be a needless effort.
Keep in mind that splitting and merging the arrays costs you processing power, so set the split size in a way, which won't actually slow your calculations down. Splitting a 10-element array can be tempting, but wont get you where you want to be. Since the complexity of std::partition is linear, it's fairly easy to overestimate the speed of the task.

Since you asked and gave an algorithm, I hope you actually need parallelization here. If so - there's nothing much to add, the algorithm itself looks really fine :)

answered May 28 '14 at 23:59

Paweł Stawarz

3,952
2
17
26

1

i thought tbb liked having more tasks then threads so it can do better load balancing? but i agree about the lower limit, i tried to hint at that in step 1. – atb May 29 '14 at 00:02
@atb the library can't do miracles. If your CPU can handle `X` tasks at once (so - __simplifying__ - it has `X` cores), it won't do more, no matter what library you use. I'm not a specialist when it comes to Intels TBB, maybe it takes care of needless threads for you, that's something to be found in the manuals. – Paweł Stawarz May 29 '14 at 00:07
1

@atb is right, there *is* a reason for creating more tasks, it helps to load-balance the work. Even for what looks like a well-balanced work, it still makes sense since there are cache-misses, interrupts, and preemptions which create the imbalance. Of course, creating too many tasks can induce excessive overheads, so TBB recommends to keep tasks at minimum of 10K clocks. Also `auto_partitioner` does a good job of finding the golden mean for the number of tasks produced by `parallel_for` or `parallel_reduce` – Anton May 29 '14 at 04:39

How to parallelize std::partition using TBB

4 Answers4