4

I read about the collapse package recently and tried to translate the following data.table code to collapse to see if it's faster in real world examples.

Here's my data.table code:

library(data.table)
library(nycflights13)

data("flights")
flights_DT <- as.data.table(flights)

val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")

flights_DT[
  j = list(agg_val_var = sum(abs(get(val_var)), na.rm = TRUE)), 
  keyby = c(id_var, by)
][
  i = order(-agg_val_var), 
  j = list(value_share = cumsum(agg_val_var)/sum(agg_val_var)), 
  keyby = by
][
  j = .SD[2L],
  keyby = by
][
  order(-value_share)
]
#>      month day value_share
#>   1:    10   3   0.5263012
#>   2:     1  24   0.5045664
#>   3:     1  20   0.4885145
#>   4:    10  17   0.4870692
#>   5:     3   6   0.4867606
#>  ---                      
#> 361:     5   4   0.3220295
#> 362:     6  15   0.3205974
#> 363:     1  28   0.3197260
#> 364:    11  25   0.3161550
#> 365:     6  14   0.3128286

Created on 2021-03-11 by the reprex package (v1.0.0)

I managed to translate the first data.table call, but struggled later on.

It would be great to see how collapse would be used to handle this use case.

Maël
  • 45,206
  • 3
  • 29
  • 67
der_grund
  • 1,898
  • 20
  • 36

1 Answers1

7

So on this the first thing I'd like to note is that collapse is not and probably never will be a full-blown split-apply combine computing tool like dplyr or data.table. It's focus is not on optimally executing arbitrary code expressions by groups, but on providing advanced and highly efficient grouped, weighted, time-series and panel data computations through the broad range of C++ based statistical and data transformation functions it provides. I refer to the vignette on collapse and data.table for further clarity on these points as well as integration examples.

Accordingly, I think it only makes sense to translate data.table code to collapse if (1) you've come up with an arcane expression in data.table to do something complex statistical it is is not good at (such as weighted aggregation, computing quantiles or the mode by groups, lagging / differencing an irregular panel, grouped centering or linear / polynomial fitting) (2) you actually don't need the data.table object but would much rather work with vectors / matrices / data.frame's / tibbles (3) you want to write a statistical program and would much prefer standard evaluation programming over NS eval and data.table syntax or (4) collapse is indeed substantially faster for your specific application.

Now to the specific code you have provided. It mixes standard and non-standard evaluation (e.g. through the use of get()), which is something collapse is not very good at. I'll give you 3 solutions ranging from full NS eval to full standard eval base R style programming.

library(data.table)
library(nycflights13)
library(magrittr)
library(collapse)

data("flights")
flights_DT <- as.data.table(flights)

# Defining a function for the second aggregation
myFUN <- function(x) (cumsum(x[1:2])/sum(x))[2L]

# Soluting 1: Non-Standard evaluation
flights_DT %>%
  fgroup_by(carrier, month, day) %>% 
  fsummarise(agg_val_var = fsum(abs(arr_delay))) %>% 
  roworder(month, day, -agg_val_var, na.last = NA) %>%
  fgroup_by(month, day) %>%
  fsummarise(value_share = myFUN(agg_val_var)) %>% 
  roworder(-value_share)
#>      month day value_share
#>   1:    10   3   0.5263012
#>   2:     1  24   0.5045664
#>   3:     1  20   0.4885145
#>   4:    10  17   0.4870692
#>   5:     3   6   0.4867606
#>  ---                      
#> 361:     5   4   0.3220295
#> 362:     6  15   0.3205974
#> 363:     1  28   0.3197260
#> 364:    11  25   0.3161550
#> 365:     6  14   0.3128286

Created on 2021-03-12 by the reprex package (v0.3.0)

Note the use of na.last = NA wich actually removes cases where agg_val_var is missing. This is needed here because fsum(NA) is NA and not 0 like sum(NA, na.rm = TRUE). Now the hybrid example which is probably closes to the code you provided:

val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")

# Solution 2: Hybrid approach with standard eval and magrittr pipes
flights_DT %>%
  get_vars(c(id_var, val_var, by)) %>%
  ftransformv(val_var, abs) %>% 
  collapv(c(id_var, by), fsum) %>%
  get_vars(c(by, val_var)) %>%
  roworderv(decreasing = c(FALSE, FALSE, TRUE), na.last = NA) %>%
  collapv(by, myFUN) %>%
  roworderv(val_var, decreasing = TRUE) %>%
  frename(replace, names(.) == val_var, "value_share")
#>      month day value_share
#>   1:    10   3   0.5263012
#>   2:     1  24   0.5045664
#>   3:     1  20   0.4885145
#>   4:    10  17   0.4870692
#>   5:     3   6   0.4867606
#>  ---                      
#> 361:     5   4   0.3220295
#> 362:     6  15   0.3205974
#> 363:     1  28   0.3197260
#> 364:    11  25   0.3161550
#> 365:     6  14   0.3128286

Created on 2021-03-12 by the reprex package (v0.3.0)

Note here that I used frename at the end to give the result column the name you wanted, as you cannot mix standard and non-standard eval in the same function in collapse. Finally, a big advantage of collapse is that you can use it for pretty low-level programming:

# Solution 3: Programming
data <- get_vars(flights_DT, c(id_var, val_var, by))
data[[val_var]] <- abs(.subset2(data, val_var))
g <- GRP(data, c(id_var, by))
data <- add_vars(get_vars(g$groups, by), 
                 fsum(get_vars(data, val_var), g, use.g.names = FALSE))
data <- roworderv(data, decreasing = c(FALSE, FALSE, TRUE), na.last = NA)
g <- GRP(data, by)
columns
data <- add_vars(g$groups, list(value_share = BY(.subset2(data, val_var), g, myFUN, use.g.names = FALSE)))
data <- roworderv(data, "value_share", decreasing = TRUE)
data
#>      month day value_share
#>   1:    10   3   0.5263012
#>   2:     1  24   0.5045664
#>   3:     1  20   0.4885145
#>   4:    10  17   0.4870692
#>   5:     3   6   0.4867606
#>  ---                      
#> 361:     5   4   0.3220295
#> 362:     6  15   0.3205974
#> 363:     1  28   0.3197260
#> 364:    11  25   0.3161550
#> 365:     6  14   0.3128286

Created on 2021-03-12 by the reprex package (v0.3.0)

I refer you to the blog post on programming with collapse for a more interesting example on how this can benefit the development of statistical codes.

Now for the evaluation, I wrapped these solutions in functions, where DT() is the data.table code you provided, run with 2 threads on a windows machine. This checks equality:

all_obj_equal(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> TRUE

Now the benchmark:

library(microbenchmark)
microbenchmark(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq       max neval cld
#>          DT() 85.81079 87.80887 91.82032 89.47025 92.54601 132.26073   100   b
#>     clp_NSE() 13.47535 14.15744 15.99264 14.80606 16.29140  28.16895   100  a 
#>  clp_Hybrid() 13.79843 14.23508 16.61606 15.00196 16.83604  32.94648   100  a 
#>    clp_Prog() 13.71320 14.17283 16.16281 14.94395 16.16935  39.24706   100  a

If you care about these milliseconds feel free to optimize, but for 340,000 obs all solutions are bloody fast.

Sebastian
  • 1,067
  • 7
  • 12
  • Thanks for the very comprehensive answer, especially for the explanatory remarks at the beginning. Based on this, I don't think `collapse` is a natural fit for my use case. But I'll keep it in mind and hopefully will give it a try later. – der_grund Mar 13 '21 at 10:17