I have a problem where I need to select and save part of a table based on one column then eliminate rows from the source table that match values in one of the columns of the saved table.
I've found that dplyr and data.table are slower than base R and am wondering if I'm doing something wrong here (an anti-pattern I'm not aware of?) or if someone knows a faster solution to this.
I need to scale it up to ~10 million rows in the search df and ~10k iterations of the y_unique search.
Here's a reasonable reproducible example...
(edit: I realized that what I was doing could be achieved through a group filter. Leaving an updated reproducible example with some tweaks from the comments below and my updated solution. -- Note that the original did not include the bind_cols(y_list) detail. In retrospect, I should have included that in this example.)
library(dplyr)
library(data.table)
library(microbenchmark)
microbenchmark(base = {
for(y_check in y_unique) {
y_list[[as.character(y_check)]] <- df[df$y == y_check, ]
df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
}
out <- bind_rows(y_list)
}, dplyr = {
for(y_check in y_unique) {
y_list[[as.character(y_check)]] <- filter(df, y == y_check)
df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
}
out <- bind_rows(y_list)
}, data.table = {
for(y_check in y_unique) {
y_list[[as.character(y_check)]] <- dt[y == y_check]
dt <- dt[!x %in% y_list[[as.character(y_check)]]$x]
}
out <- do.call(rbind, y_list)
}, alternate = {
df <- group_by(df, x)
out <- filter(df, y == min(y))
}, times = 10, setup = {
set.seed(1)
df <- data.frame(x = sample(1:1000, size = 1000, replace = TRUE),
y = sample(1:100, size = 1000, replace = TRUE))
dt <- data.table(df)
y_unique <- sort(unique(df$y))
y_list <- setNames(rep(list(list()), length(y_unique)), y_unique)
})
I get:
Unit: milliseconds
expr min lq mean median uq max neval
base 12.939135 13.22883 13.623098 13.500897 13.95468 14.517167 10
dplyr 41.517351 42.22595 50.041123 45.199978 61.33194 65.927611 10
data.table 228.014360 233.98309 248.281965 240.172383 263.39943 287.706941 10
alternate 3.310031 3.42016 3.745013 3.454537 4.17488 4.497455 10
On my real data, I get more or less the same. The base is 2+ times faster than dplyr and data.table is... slow. Any ideas?