df %>% split(.$x)
becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then perform split on each subset we reduce the time by at least an order of magnitude.
library(dplyr)
library(microbenchmark)
library(caret)
library(purrr)
N <- 10^6
groups <- 10^5
df <- data.frame(x = sample(1:groups, N, replace = TRUE),
y = sample(letters, N, replace = TRUE))
ids <- df$x %>% unique
folds10 <- createFolds(ids, 10)
folds100 <- createFolds(ids, 100)
Running microbenchmark
gives us
## Unit: seconds
## expr mean
l1 <- df %>% split(.$x) # 242.11805
l2 <- lapply(folds10, function(id) df %>%
filter(x %in% id) %>% split(.$x)) %>% flatten # 50.45156
l3 <- lapply(folds100, function(id) df %>%
filter(x %in% id) %>% split(.$x)) %>% flatten # 12.83866
Is split
not designed for large groups? Are there any alternatives besides the manual initial subsetting?
My laptop is a macbook pro late 2013, 2.4GHz 8GB