When does .race or .hyper outperform non-data-parallelized versions?

Question

I have this code:

# Grab Nutrients.csv from https://data.nal.usda.gov/dataset/usda-branded-food-products-database/resource/c929dc84-1516-4ac7-bbb8-c0c191ca8cec
my @nutrients = "/path/to/Nutrients.csv".IO.lines;
for @nutrients.race {
    my @data = $_.split('","');
    .say if @data[2] eq "Protein" and @data[4] > 70 and @data[5] ~~ /^g/;
};

Nutrients.csv is a 174 MB file, with lots of rows. Non-trivial stuff is done on every row, but there's no data dependency. However, this takes circa 54s while the non-race version uses 43 seconds, 20% less. Any idea of why that happens? Is the kind of operation done here still too little for data parallelism to take hold? I have seen it only working with very heavy operations, like checking if something is prime. In that case, any ballpark of how much should be done for every piece of data to make data parallelism worth the while?

This is not an answerable question (or maybe I just don't understand what "parallelism to take hold"). There is no magic way to determine the amount to be "done" to make it fastest... if there was rakudo would just do it for you. Parallelism is hard not just because of race conditions, but because the programmer needs to know their data (and even hardware) to make good use of it. fwiw I would consider the work done in the example to be trivial. — ugexe, Jun 21 '20 at 19:13
As an example -- `time raku -e '(1..10).race(:batch(1)).map({ sleep 1 })'` vs `time raku -e '(1..10).map({ sleep 1 })'` — ugexe, Jun 21 '20 at 19:21
so, this code loads (O) 1M lines into shared memory, then creates a supervisor thread, then creates a pool of worker threads, then dispatches line[0] to threadA and so on... I wonder what the time is if you multiply the data lines (ie. just cat the .csv x2 and x5)?? — librasteve, Jun 21 '20 at 21:15
You *must* put a *prefix* in *front* of the *`for`* in *addition to* the `.race` if you want it to race. See [timotimo's answer to **Is there a fast parallel “for” loop in Perl 6?**](https://stackoverflow.com/a/47308560/1077672). — raiph, Jun 21 '20 at 22:32
cf your related prior SO [When is “race” worthwhile in Perl 6?](https://stackoverflow.com/questions/51462786/when-is-race-worthwhile-in-perl-6) and Liz's answer to that question (and perhaps both your comment and mine under it). — raiph, Jun 21 '20 at 22:32

score 7 · Accepted Answer · answered Jun 21 '20 at 20:34

Assuming that "outperform" is defined as "using less wallclock":

Short answer: when it does.

Longer answer: when the overhead of batching values, distributing over multiple threads and collecting results + the actual CPU that is needed for the work divided by the number of threads, results in a shorter runtime.

Still longer answer: the dispatcher thread needs some CPU to batch up values and hand the work over to a worker thread and then process its result. As long as that amount of CPU is more than the amount of CPU needed to do the work, you will only use one thread (because by the time the dispatcher thread is ready to dispatch, the only worker thread is ready to receive more work). Which means you've made things worse, because the actual work is now still being done by one thread, but you've added a lot of overhead and latency.

So make sure that the amount of work a worker thread needs to do, is big enough so that the dispatcher thread will need to start up another thread for the next piece of work. This can be done by increasing the batch-size. But a bigger batch, also means that the dispatcher thread will need more CPU to create the batch. Which in turn can make the worker thread be ready to receive the next batch, in which case you're back to just having added overhead.

There are still plans to make the batch size adapt itself automatically to the amount of work that a worker thread needs to do. But unfortunately, that will also require quite an extensive reworking of the current implementation of hyper and race. So don't expect that any time soon, and definitely not before the Great Dispatcher Overhaul has landed.

An important thing this is missing is cache invalidation, and the CPU cache in general in relation to context switching. — ugexe, Jun 21 '20 at 22:48
I don't think the implementation of `race` / `hyper` currently is at a level that such considerations actually have value. I wish we were, though. — Elizabeth Mattijsen, Jun 22 '20 at 09:36

score 6 · Answer 2 · answered Jun 23 '20 at 07:08

Please have a look at:

Raku .hyper() and .race() example not working

The syntax in your example should be:

my @nutrients = "/path/to/Nutrients.csv".IO.lines;
race for @nutrients.race(batch => 1, degree => 2) 
{
     my @data = $_.split('","');
     .say if @data[2] eq "Protein" and @data[4] > 70 and @data[5] ~~ /^g/;
}

The "race" in front of the "for" makes the difference.

When does .race or .hyper outperform non-data-parallelized versions?

2 Answers2

Linked