Improve runtime of group_by and summarize

Question

I have a data frame df of around 10 million employees. Each employee has an ID and there is a city variable and a company variable that shows where they work:

ID city company
1  NYC  ABC
2  BOS  ABC
1  NYC  DEF
3  SEA  GHI

I want to group_by ID and city and then figure out the number of companies each employee works for:

ID city count
1  NYC  2
2  BOS  1
3  SEA  1

My code is df %>% group_by(ID, city) %>% summarize(count = n_distinct(company)) However, this is taking a long time to run. My set-up typically supports computationally heavy operations, but I am having a hard time running this chunk of code.

Does anyone have a different (ideally faster implementation)?

Edit: n_distinct(company) instead of n_distinct(ID)

Use `data.table` if `dplyr` isn't fast enough. Or `dtplyr` to split the difference. — Gregor Thomas, Sep 30 '22 at 02:43
Also note that if you are grouping by ID each group will only have 1 ID, so `n_distinct(ID)` will always be 1. I assume there's a `company` column missing in your question? — Gregor Thomas, Sep 30 '22 at 02:44
And lastly, don't use `distinct` unless you need to. Do employees show have multiple entries for the same company in the same city? If not, you can use `n()` which will be much faster than `n_distinct()`. — Gregor Thomas, Sep 30 '22 at 02:46
@GregorThomas you're right, it should read count = n_distinct(company). Unfortunately, there are some duplicates, so I'm afraid I'll need to use n_distinct(). How would you use data.table in this case? — questionmark, Sep 30 '22 at 02:50
I haven't benchmarked this but you could try `uniqueN`, e.g. `library(data.table); setDF(df); df[, .(count = uniqueN(company)), by = c("ID", "city")]` — Maurits Evers, Sep 30 '22 at 03:03

score 2 · Answer 1 · answered Sep 30 '22 at 03:18

Here is a benchmark comparison of a data.table and dplyr solution using a larger (and hopefully representative?) dataset.

# Sample data
df <- data.frame(
    ID = sample(c(1:10), 10000, replace = TRUE),
    city = sample(c("NYC", "BOS", "SEA"), 10000, replace = TRUE),
    company = sample(c("ABC", "DEF", "GHI","JKL", "MNO", "PQR", "STU"), 10000, replace = TRUE))

library(data.table)
dt <- as.data.table(df)

library(microbenchmark)
res <- microbenchmark(
    dplyr = df %>% 
        group_by(ID, city) %>% 
        summarise(count = n_distinct(company), .groups = "drop"),
    datatable = dt[, .(count = uniqueN(company)), by = c("ID", "city")]
)
res
#Unit: milliseconds
#     expr      min       lq      mean   median        uq      max neval
#    dplyr 6.843204 7.696959 10.110256 9.915225 10.906205 41.35438   100
#datatable 1.893994 2.255023  2.924953 2.738450  3.395504  7.61165   100

autoplot(res)

Very interesting, I'm glad my haunch was right and it was *really* slower than it should be. Thanks for the benchmark comparison! — questionmark, Sep 30 '22 at 03:34
It'd be interesting to run this on better & more representative data. I think the sample data I created is too simplistic. Might be that for larger & more complex data the `data.table` solution is even faster (which is what I somehow expected...). — Maurits Evers, Sep 30 '22 at 03:38

score 2 · Answer 2 · answered Sep 30 '22 at 04:40

We have another option. We can use the collapse package, which is C/C++ based package for data manipulation. The syntax is similar to dplyr. In the following example, fgroup_by, fsummarise, and fndistinct are from the collapse package, which perform the same task as group_by, summarise, and n_distinct from dplyr.

library(dplyr)
library(collapse)

dat <- read.table(textConnection("ID city company
1  NYC  ABC
2  BOS  ABC
1  NYC  DEF
3  SEA  GHI"), header = TRUE)

dat %>%
  fgroup_by(ID, city) %>%
  fsummarise(count = fndistinct(company))
#   ID city count
# 1  1  NYC     2
# 2  2  BOS     1
# 3  3  SEA     1

Below is a speed comparison. I modified the examples and code from @Maurits Evers.

library(dplyr)
library(data.table)
library(collapse)
library(ggplot2)
library(microbenchmark)

# Sample data
df <- data.frame(
  ID = sample(c(1:10), 10000, replace = TRUE),
  city = sample(c("NYC", "BOS", "SEA"), 10000, replace = TRUE),
  company = sample(c("ABC", "DEF", "GHI","JKL", "MNO", "PQR", "STU"), 10000, replace = TRUE))

dt <- as.data.table(df)

res <- microbenchmark(
  dplyr = df %>% 
    group_by(ID, city) %>% 
    summarise(count = n_distinct(company), .groups = "drop"),
  datatable = dt[, .(count = uniqueN(company)), by = c("ID", "city")],
  collapse = df %>%
    fgroup_by(ID, city) %>%
    fsummarise(count = fndistinct(company))
)
res
# Unit: microseconds
#      expr      min       lq     mean   median        uq       max neval cld
#     dplyr 4885.501 5776.400 6658.784 6108.901 7073.8015 16177.301   100   c
# datatable 2705.501 3173.351 3500.120 3332.501 3643.5510  9076.502   100  b 
#  collapse  283.001  347.251  482.880  402.901  452.2505  7740.401   100 a

Wow! Didn't know about the collapse package. Thanks for running the comparison! — questionmark, Oct 01 '22 at 04:05

Improve runtime of group_by and summarize

2 Answers2