dplyr
is not liking my large dataset so I'm try to convert the following simple code to the most efficient data.table
equivalent:
library(tidyverse)
data(iris)
iris$year <- rep(c(2000, 3000), each = 25)
iris$color <- rep(c("red", "green","blue"), each = 50)
iris$letter <- as.factor(rep(c("A", "B", "C"), each = 50))
head(iris, 3)
iris %>%
group_by(Species, year) %>%
summarise(across(c(-Sepal.Length, -Sepal.Width), dplyr::first),
across(c(Sepal.Length, Sepal.Width), dplyr::last)) %>%
ungroup
However my effort is giving me the wrong solution and is also not naming columns:
library(data.table)
final <- setDT(iris)[, c(
lapply(setdiff(names(iris), c("Sepal.Length", "Sepal.Width")), head, 1),
lapply(c("Sepal.Length", "Sepal.Width"), tail, 1)
), by = c("Species", "year")]
final
Maybe there is a quicker/better data.table
approach?
thanks
EDIT
When I let the above dplyr
code run on my real data (~3million rows, 80 columns), I ran into memory problems. It ran in rstudio
for about 15hours before aborting. summarise
with across
has been known to be slower than summarise_at
(see here) but I thought they should be the same now. Using the solutions below and taking the first 20000
rows of my real dataset, I microbenchmark
times = 10
and got:
#NOTE this is on my real dataset so not reproducible
microbenchmark(datatable <- as.data.table(real_data)[, c(lapply(.SD[, nm1, with = FALSE], first),
lapply(.SD[, nm2, with = FALSE], last)), .( Species, year)],
collapse_package <- collap(real_data, ~ Species + year, custom = list(ffirst = nm1, flast = nm2)),
sqldf_df <- fn$sqldf("
with first_data as (select min(rowid), $byVar, $firstVar from real_data group by $byVar),
last_data as (select max(rowid), $byVar, $lastVar from real_data group by $byVar)
select $byVar, $firstVar, $lastVar from first_data left join last_data using($byVar)
", dbname = tempfile()),
sqldf_df_no_dbname <- fn$sqldf("
with first_data as (select min(rowid), $byVar, $firstVar from real_data group by $byVar),
last_data as (select max(rowid), $byVar, $lastVar from real_data group by $byVar)
select $byVar, $firstVar, $lastVar from first_data left join last_data using($byVar)
"),
dplyr_sum_across <- real_data %>%
group_by(Species, year) %>%
summarise(
across(c(-Sepal.Length, -Sepal.Width), dplyr::first),
across(c(Sepal.Length, Sepal.Width), dplyr::last)), times = 10)
# min lq mean median uq max neval cld
# datatable 9664.3822 9974.6145 10211.00909 10130.2571 10438.7439 10872.2079 10 b
# collapse_package 4.9311 5.0039 5.10331 5.0677 5.1597 5.5432 10 a
# sqldf_df 394.3706 395.7660 403.82425 399.2484 401.9162 450.3884 10 a
# sqldf_df_no_dbname 374.9822 380.2022 385.52904 382.6653 387.7198 402.9556 10 a
# dplyr_sum_across 23969.3657 25055.5517 25800.82757 25653.1470 26262.3583 27616.5212 10 c
library(collapse)
run times are very impressive! Nice overview here.