We have another option. We can use the collapse
package, which is C/C++ based package for data manipulation. The syntax is similar to dplyr
. In the following example, fgroup_by
, fsummarise
, and fndistinct
are from the collapse
package, which perform the same task as group_by
, summarise
, and n_distinct
from dplyr
.
library(dplyr)
library(collapse)
dat <- read.table(textConnection("ID city company
1 NYC ABC
2 BOS ABC
1 NYC DEF
3 SEA GHI"), header = TRUE)
dat %>%
fgroup_by(ID, city) %>%
fsummarise(count = fndistinct(company))
# ID city count
# 1 1 NYC 2
# 2 2 BOS 1
# 3 3 SEA 1
Below is a speed comparison. I modified the examples and code from @Maurits Evers.
library(dplyr)
library(data.table)
library(collapse)
library(ggplot2)
library(microbenchmark)
# Sample data
df <- data.frame(
ID = sample(c(1:10), 10000, replace = TRUE),
city = sample(c("NYC", "BOS", "SEA"), 10000, replace = TRUE),
company = sample(c("ABC", "DEF", "GHI","JKL", "MNO", "PQR", "STU"), 10000, replace = TRUE))
dt <- as.data.table(df)
res <- microbenchmark(
dplyr = df %>%
group_by(ID, city) %>%
summarise(count = n_distinct(company), .groups = "drop"),
datatable = dt[, .(count = uniqueN(company)), by = c("ID", "city")],
collapse = df %>%
fgroup_by(ID, city) %>%
fsummarise(count = fndistinct(company))
)
res
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dplyr 4885.501 5776.400 6658.784 6108.901 7073.8015 16177.301 100 c
# datatable 2705.501 3173.351 3500.120 3332.501 3643.5510 9076.502 100 b
# collapse 283.001 347.251 482.880 402.901 452.2505 7740.401 100 a
