So on this the first thing I'd like to note is that collapse
is not and probably never will be a full-blown split-apply combine computing tool like dplyr
or data.table
. It's focus is not on optimally executing arbitrary code expressions by groups, but on providing advanced and highly efficient grouped, weighted, time-series and panel data computations through the broad range of C++ based statistical and data transformation functions it provides. I refer to the vignette on collapse
and data.table
for further clarity on these points as well as integration examples.
Accordingly, I think it only makes sense to translate data.table
code to collapse
if (1) you've come up with an arcane expression in data.table
to do something complex statistical it is is not good at (such as weighted aggregation, computing quantiles or the mode by groups, lagging / differencing an irregular panel, grouped centering or linear / polynomial fitting) (2) you actually don't need the data.table
object but would much rather work with vectors / matrices / data.frame's / tibbles (3) you want to write a statistical program and would much prefer standard evaluation programming over NS eval and data.table
syntax or (4) collapse
is indeed substantially faster for your specific application.
Now to the specific code you have provided. It mixes standard and non-standard evaluation (e.g. through the use of get()
), which is something collapse
is not very good at. I'll give you 3 solutions ranging from full NS eval to full standard eval base R style programming.
library(data.table)
library(nycflights13)
library(magrittr)
library(collapse)
data("flights")
flights_DT <- as.data.table(flights)
# Defining a function for the second aggregation
myFUN <- function(x) (cumsum(x[1:2])/sum(x))[2L]
# Soluting 1: Non-Standard evaluation
flights_DT %>%
fgroup_by(carrier, month, day) %>%
fsummarise(agg_val_var = fsum(abs(arr_delay))) %>%
roworder(month, day, -agg_val_var, na.last = NA) %>%
fgroup_by(month, day) %>%
fsummarise(value_share = myFUN(agg_val_var)) %>%
roworder(-value_share)
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note the use of na.last = NA
wich actually removes cases where agg_val_var
is missing. This is needed here because fsum(NA)
is NA
and not 0
like sum(NA, na.rm = TRUE)
. Now the hybrid example which is probably closes to the code you provided:
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
# Solution 2: Hybrid approach with standard eval and magrittr pipes
flights_DT %>%
get_vars(c(id_var, val_var, by)) %>%
ftransformv(val_var, abs) %>%
collapv(c(id_var, by), fsum) %>%
get_vars(c(by, val_var)) %>%
roworderv(decreasing = c(FALSE, FALSE, TRUE), na.last = NA) %>%
collapv(by, myFUN) %>%
roworderv(val_var, decreasing = TRUE) %>%
frename(replace, names(.) == val_var, "value_share")
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note here that I used frename
at the end to give the result column the name you wanted, as you cannot mix standard and non-standard eval in the same function in collapse
. Finally, a big advantage of collapse
is that you can use it for pretty low-level programming:
# Solution 3: Programming
data <- get_vars(flights_DT, c(id_var, val_var, by))
data[[val_var]] <- abs(.subset2(data, val_var))
g <- GRP(data, c(id_var, by))
data <- add_vars(get_vars(g$groups, by),
fsum(get_vars(data, val_var), g, use.g.names = FALSE))
data <- roworderv(data, decreasing = c(FALSE, FALSE, TRUE), na.last = NA)
g <- GRP(data, by)
columns
data <- add_vars(g$groups, list(value_share = BY(.subset2(data, val_var), g, myFUN, use.g.names = FALSE)))
data <- roworderv(data, "value_share", decreasing = TRUE)
data
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
I refer you to the blog post on programming with collapse
for a more interesting example on how this can benefit the development of statistical codes.
Now for the evaluation, I wrapped these solutions in functions, where DT()
is the data.table
code you provided, run with 2 threads on a windows machine. This checks equality:
all_obj_equal(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> TRUE
Now the benchmark:
library(microbenchmark)
microbenchmark(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> DT() 85.81079 87.80887 91.82032 89.47025 92.54601 132.26073 100 b
#> clp_NSE() 13.47535 14.15744 15.99264 14.80606 16.29140 28.16895 100 a
#> clp_Hybrid() 13.79843 14.23508 16.61606 15.00196 16.83604 32.94648 100 a
#> clp_Prog() 13.71320 14.17283 16.16281 14.94395 16.16935 39.24706 100 a
If you care about these milliseconds feel free to optimize, but for 340,000 obs all solutions are bloody fast.