Update:
I tried @Adam's approach, which did the trick. Using 8 cores, processing only took 20 minutes. This is great so far.
But I ran into another problem: foreach
's %dopar%
operator forks the workspace n
times (equal to the number of cores registered). This isn't problematic per se, but with a 20 GB data set this will turn ugly quick.
So a limitation to the original question: Can this be done without having the whole data set in memory forked n
times?
Situation:
I need to process a large tibble (>30 million rows). The tibble is being grouped and a function is called for each group.
In a first step, I reduce the data set by extracting only those rows that share a value in the column name
like so:
duplicates <- data[duplicated(data$name, fromLast=FALSE) |
duplicated(data$name, fromLast=TRUE),]
Right after that, I process these duplicates as stated above:
myTibble <- duplicated %>% group_by(name)
%>% myFunction(., myOtherArgument=myOtherArgument)
%>% ungroup
The grouping of the duplicates with group_by
generally results in about >500k groups. Each group is being processed by myFunction()
. This processing, right now, takes about 90 minutes which is okay, but far from satisfying.
Question:
Is there a way to speed up the processing of these groups so that this task won't take 90 minutes but significantly less time?
(including but not restricted to multiprocessing)