dplyr::across performance and dplyr::summarise to data.table efficiency

Question

dplyr is not liking my large dataset so I'm try to convert the following simple code to the most efficient data.table equivalent:

library(tidyverse)
data(iris)
iris$year <- rep(c(2000, 3000), each = 25) 
iris$color <- rep(c("red", "green","blue"), each = 50) 
iris$letter <- as.factor(rep(c("A", "B", "C"), each = 50)) 
head(iris, 3)

iris %>% 
  group_by(Species, year) %>% 
  summarise(across(c(-Sepal.Length, -Sepal.Width), dplyr::first), 
            across(c(Sepal.Length, Sepal.Width), dplyr::last)) %>% 
  ungroup

However my effort is giving me the wrong solution and is also not naming columns:

library(data.table)
final <- setDT(iris)[, c(
  lapply(setdiff(names(iris), c("Sepal.Length", "Sepal.Width")), head, 1), 
  lapply(c("Sepal.Length", "Sepal.Width"), tail, 1)
), by = c("Species", "year")]
final

Maybe there is a quicker/better data.table approach?

thanks

EDIT

When I let the above dplyr code run on my real data (~3million rows, 80 columns), I ran into memory problems. It ran in rstudio for about 15hours before aborting. summarise with across has been known to be slower than summarise_at (see here) but I thought they should be the same now. Using the solutions below and taking the first 20000 rows of my real dataset, I microbenchmark times = 10 and got:

#NOTE this is on my real dataset so not reproducible 
microbenchmark(datatable <- as.data.table(real_data)[, c(lapply(.SD[, nm1, with = FALSE], first),
                                                         lapply(.SD[, nm2, with = FALSE], last)), .( Species, year)],
               
               collapse_package <- collap(real_data, ~  Species + year, custom = list(ffirst = nm1, flast = nm2)),
               
               sqldf_df <- fn$sqldf("
 with first_data as (select min(rowid), $byVar, $firstVar from real_data group by $byVar),
      last_data as (select max(rowid), $byVar, $lastVar from real_data group by $byVar)
 select $byVar, $firstVar, $lastVar from first_data left join last_data using($byVar)
", dbname = tempfile()),
               sqldf_df_no_dbname <- fn$sqldf("
 with first_data as (select min(rowid), $byVar, $firstVar from real_data group by $byVar),
      last_data as (select max(rowid), $byVar, $lastVar from real_data group by $byVar)
 select $byVar, $firstVar, $lastVar from first_data left join last_data using($byVar)
"),
               dplyr_sum_across <- real_data %>% 
                 group_by(Species, year) %>%  
                 summarise(
                   across(c(-Sepal.Length, -Sepal.Width), dplyr::first), 
                   across(c(Sepal.Length, Sepal.Width), dplyr::last)), times = 10)

#                         min         lq        mean     median         uq        max neval cld
# datatable         9664.3822  9974.6145 10211.00909 10130.2571 10438.7439 10872.2079    10  b 
# collapse_package     4.9311     5.0039     5.10331     5.0677     5.1597     5.5432    10  a  
# sqldf_df           394.3706   395.7660   403.82425   399.2484   401.9162   450.3884    10  a  
# sqldf_df_no_dbname 374.9822   380.2022   385.52904   382.6653   387.7198   402.9556    10  a  
# dplyr_sum_across 23969.3657 25055.5517 25800.82757 25653.1470 26262.3583 27616.5212    10  c

library(collapse) run times are very impressive! Nice overview here.

Can you provide the benchmarking complete code for reproducibility. Note that when I benchmarked sqldf using the data in the question the sqldf with no bname ran twice as fast as the df. — G. Grothendieck, Mar 26 '21 at 18:53
as I cant include real data, its hard to make fully reproducible but I have included `microbenchmark` code above — user63230, Mar 26 '21 at 19:12
The last occurrence of real_data in sqldf_df should be last_data as in the answer. — G. Grothendieck, Mar 26 '21 at 19:36

akrun · Accepted Answer · 2021-03-24T19:27:31.360

5

With data.table, we can use

nm1 <- c("Petal.Length", "Petal.Width", "color", "letter")
nm2 <- c("Sepal.Length", "Sepal.Width")
as.data.table(iris)[, c(lapply(.SD[, nm1, with = FALSE], first),
     lapply(.SD[, nm2, with = FALSE], last)), .(Species, year)]

-output

#      Species year Petal.Length Petal.Width color letter Sepal.Length Sepal.Width
#1:     setosa 2000          1.4         0.2   red      A          4.8         3.4
#2:     setosa 3000          1.6         0.2   red      A          5.0         3.3
#3: versicolor 2000          4.7         1.4 green      B          6.4         2.9
#4: versicolor 3000          4.4         1.4 green      B          5.7         2.8
#5:  virginica 2000          6.0         2.5  blue      C          6.7         3.3
#6:  virginica 3000          6.0         1.8  blue      C          5.9         3.0

Or another option is collapse

library(collapse)
collap(iris, ~ Species + year, custom = list(ffirst = nm1, flast = nm2))
#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species year color letter
#1          4.8         3.4          1.4         0.2     setosa 2000   red      A
#2          5.0         3.3          1.6         0.2     setosa 3000   red      A
#3          6.4         2.9          4.7         1.4 versicolor 2000 green      B
#4          5.7         2.8          4.4         1.4 versicolor 3000 green      B
#5          6.7         3.3          6.0         2.5  virginica 2000  blue      C
#6          5.9         3.0          6.0         1.8  virginica 3000  blue      C

edited Mar 24 '21 at 19:27

answered Mar 24 '21 at 18:46

akrun

874,273
37
540
662

thanks, I presume there will be no issues if groupings have only one row with this approach? – user63230 Mar 24 '21 at 18:55
1

@user63230 It is a standard approach to use `lapply` with `by`. Only difference is that we are combining two sets of columns by two calls to lapply and then concatenating (`c`) the output – akrun Mar 24 '21 at 18:57
its still running on my real dataset which is surprising as its not _that_ big (~ 3million rows 80 columns). I have a feeling there may be a problem with my data. I have to go now but I will come back to it – user63230 Mar 24 '21 at 19:21
@user63230 Try with a single `lapply` and check if it is fast or not i.e. `out1 <- as.data.table(iris)[, lapply(.SD, first), .(Species, year), .SDcols = nm1]` and same with `out2 <- as.data.table(iris)[, lapply(.SD, last), .(Species, year), .SDcols = nm2]` and then do a join `out1[out2, (nm2) := mget(nm2), on = .(Species, year)]` – akrun Mar 24 '21 at 19:24
@user63230 can you also try with `collapse` (updated solution). It could be faster – akrun Mar 24 '21 at 19:28
1

`collapse` is brilliant, see timings above. I'm surprised `dplyr` version would not even finish. – user63230 Mar 26 '21 at 16:47
@user63230 data.table authors are also contributing to collapse. – akrun Mar 26 '21 at 20:17

G. Grothendieck · Answer 2 · 2021-03-26T23:31:54.337

1) sqldf sqldf can perform the calculation outside of R by specifying an external database name (see dbname= argument used below) so that R memory limitations for intermediate calculations do not affect it. You can also try it without the dbname= argument just in case you actually do have enough memory.

library(sqldf)

# enclose each argument in [...] and then create comma separated string
varString <- function(...) toString(sprintf("[%s]", c(...)))

firstVar <- varString("Petal.Length", "Petal.Width", "color", "letter")
lastVar <- varString("Sepal.Length", "Sepal.Width")
byVar <- varString("Species", "year")

fn$sqldf("
 with first_data as (select min(rowid), $byVar, $firstVar from iris group by $byVar),
      last_data as (select max(rowid), $byVar, $lastVar from iris group by $byVar)
 select $byVar, $firstVar, $lastVar from first_data left join last_data using($byVar)
", dbname = tempfile())

giving:

     Species year Petal.Length Petal.Width color letter Sepal.Length Sepal.Width
1     setosa 2000          1.4         0.2   red      A          4.8         3.4
2     setosa 3000          1.6         0.2   red      A          5.0         3.3
3 versicolor 2000          4.7         1.4 green      B          6.4         2.9
4 versicolor 3000          4.4         1.4 green      B          5.7         2.8
5  virginica 2000          6.0         2.5  blue      C          6.7         3.3
6  virginica 3000          6.0         1.8  blue      C          5.9         3.0

2) Base R This solution only uses base R. The line computing a unique key for each combination of Species and year might need to be modified for other data.

key <- as.integer(iris$Species) + as.integer(iris$year)
i <- !duplicated(key)
j <- !duplicated(key, fromLast = TRUE)
data.frame(
  iris[i, c("Species", "year", "Petal.Length", "Petal.Width")], 
  iris[j, c("Sepal.Length", "Sepal.Width")]
)

Note that the question stated that the problem was the size of the data, not the speed. — G. Grothendieck, Mar 26 '21 at 18:54
I thought it was the size of the dataset that was causing `dplyr` to be slow. After running for +15hrs `rstudio` aborted. When I try a slightly simpler version, just one call to `summarize_at` and I tried to run this it slowed my whole computer down and I had to kill it. No such issues with `data.table` `collapse` or `sqldf` — user63230, Mar 26 '21 at 19:00

dplyr::across performance and dplyr::summarise to data.table efficiency

2 Answers2

Linked