0

Update:

I tried @Adam's approach, which did the trick. Using 8 cores, processing only took 20 minutes. This is great so far.

But I ran into another problem: foreach's %dopar% operator forks the workspace n times (equal to the number of cores registered). This isn't problematic per se, but with a 20 GB data set this will turn ugly quick.

So a limitation to the original question: Can this be done without having the whole data set in memory forked n times?

Situation:

I need to process a large tibble (>30 million rows). The tibble is being grouped and a function is called for each group.

In a first step, I reduce the data set by extracting only those rows that share a value in the column name like so:

duplicates <- data[duplicated(data$name, fromLast=FALSE) | 
                      duplicated(data$name, fromLast=TRUE),]

Right after that, I process these duplicates as stated above:

myTibble <- duplicated %>% group_by(name) 
                       %>% myFunction(., myOtherArgument=myOtherArgument)
                       %>% ungroup

The grouping of the duplicates with group_by generally results in about >500k groups. Each group is being processed by myFunction(). This processing, right now, takes about 90 minutes which is okay, but far from satisfying.

Question:

Is there a way to speed up the processing of these groups so that this task won't take 90 minutes but significantly less time?

(including but not restricted to multiprocessing)

Christian
  • 707
  • 1
  • 7
  • 15
  • 1
    You can probabily speed it up if you use the `data.table` package. You could also try `dtplyr`, which is suppoused to use the same syntax as `dplyr` but use `data.table` but translates everything to `data.table` – Fino Dec 03 '19 at 15:24
  • 2
    Can also maybe try using `group_split()` to split the `tibble` into a list split on the groups. Then pass the list into a parallel framework such as `foreach` with `%dopar%`. –  Dec 03 '19 at 15:30
  • @Adam This helped, but I ran into another problem. I've updated the question accordingly. Thank you regardless! – Christian Dec 04 '19 at 13:57
  • Maybe try breaking it up into batches? So a loop within a loop? That way only a subset gets passed into each parallel iteration? –  Dec 04 '19 at 15:48
  • Can you confirm that most of the time is taken my running `myFunction()`? Do you use a `summarize()`? – F. Privé Dec 05 '19 at 06:12

0 Answers0