How to go about parallelizing my processing using tbb::parallel_for and tbb::dataflow?

Question

I have a source of files that I need to process. From each file, my code generates a variable number of data objects, let's call it N. I have K number of processing objects that can be used to process the N data objects.

I'm thinking of doing the following using Tbb:dataflow:

Create a function_node with concurrency K and put my K processing objects into a concurrent_queue.
Use input_node to read file, generate the N data objects, and try_put each into the function_node.
The function_node body dequeues a processing object, uses it to process a data object, then returns the processing object back to the concurrent_queue when done.

Another way I can think of is possibly like so:

Create a function_node with serial concurrency.
Use input_node to read file, generate the N data objects, put the data objects into a collection and send over to the function_node.
At the function_node, partition the N objects into K ranges and use each of the K processing objects to process each range concurrently - not sure if it is possible to customize parallel_for for this purpose.

The advantage of the first method is probably lower latency because I can start sending data objects through the dataflow the moment they are generated rather than have to wait for all N data objects to be generated.

What do you think is the best way to go about parallelizing this processing?

score 0 · Accepted Answer · answered Apr 11 '20 at 20:08

Yes, you are right that the first method has this advantage of not waiting all of the data objects to start their processing. However, it also has an advantage of not waiting completion of processing all of the data objects passed to parallel_for. This becomes especially visible if the speed of processing varies for each data object and/or by each processing object.

Also, it seems enough to have buffer_node followed by (perhaps, reserving) join_node instead of concurrent_queue for saving of processing objects for further reuse. In this case, function_node would return processing object back to the buffer_node once it finishes processing of the data object. So, the graph will look like the following:

input_node     ->  input_port<0>(join_node);
buffer_node    ->  input_port<1>(join_node);
join_node      ->  function_node;
function_node  ->  buffer_node;

In this case, the concurrency of the function_node can be left unlimited as it would be automatically followed by the number of processing objects that exist (available tokens) in the graph.

Also, note that generating data objects from different files can be done in parallel as well. If you see benefit from that consider using function_node instead of input_node as the latter is always serial. However, in this case, use join_node with queueing policy since function_node is not reservable.

Also, please consider using tbb::parallel_pipeline instead as it seems you have a classic pipelining scheme of processing. In particular, this and that link might be useful.

Thanks for your advice! I like it that my function node concurrency can now be unlimited. I'm using a similar join_node/token-buffer scheme to limit the number of objects passing through my graph. I've also expanded my graph to include a couple of async_nodes that run GPU kernels. I'm a TBB newbie but it seems that flow graph is a better fit for my use case than parallel_pipeline but I'll read up on parallel_pipeline like you suggested. — user3134182, Apr 18 '20 at 15:08

How to go about parallelizing my processing using tbb::parallel_for and tbb::dataflow?

1 Answers1