Speed up processing of grouped data

Question

Update:

I tried @Adam's approach, which did the trick. Using 8 cores, processing only took 20 minutes. This is great so far.

But I ran into another problem: foreach's %dopar% operator forks the workspace n times (equal to the number of cores registered). This isn't problematic per se, but with a 20 GB data set this will turn ugly quick.

So a limitation to the original question: Can this be done without having the whole data set in memory forked n times?

Situation:

I need to process a large tibble (>30 million rows). The tibble is being grouped and a function is called for each group.

In a first step, I reduce the data set by extracting only those rows that share a value in the column name like so:

duplicates <- data[duplicated(data$name, fromLast=FALSE) | 
                      duplicated(data$name, fromLast=TRUE),]

Right after that, I process these duplicates as stated above:

myTibble <- duplicated %>% group_by(name) 
                       %>% myFunction(., myOtherArgument=myOtherArgument)
                       %>% ungroup

The grouping of the duplicates with group_by generally results in about >500k groups. Each group is being processed by myFunction(). This processing, right now, takes about 90 minutes which is okay, but far from satisfying.

Question:

Is there a way to speed up the processing of these groups so that this task won't take 90 minutes but significantly less time?

(including but not restricted to multiprocessing)

You can probabily speed it up if you use the `data.table` package. You could also try `dtplyr`, which is suppoused to use the same syntax as `dplyr` but use `data.table` but translates everything to `data.table` — Fino, Dec 03 '19 at 15:24
Can also maybe try using `group_split()` to split the `tibble` into a list split on the groups. Then pass the list into a parallel framework such as `foreach` with `%dopar%`. — , Dec 03 '19 at 15:30
@Adam This helped, but I ran into another problem. I've updated the question accordingly. Thank you regardless! — Christian, Dec 04 '19 at 13:57
Maybe try breaking it up into batches? So a loop within a loop? That way only a subset gets passed into each parallel iteration? — , Dec 04 '19 at 15:48
Can you confirm that most of the time is taken my running `myFunction()`? Do you use a `summarize()`? — F. Privé, Dec 05 '19 at 06:12

Speed up processing of grouped data

Update:

Situation:

Question:

0 Answers0