Use `data.table` to get first of subgroup based on a variable

Question

Consider a data set consisting of a grouping variable (here id) and an ordered variable (here date)

(df <- data.frame(
  id = rep(1:2,2),
  date = 4:1
))
#   id date
# 1  1    4
# 2  2    3
# 3  1    2
# 4  2    1

I'm wondering what the easiest way is in data.table to do the equivalent of this dplyr code:

library(dplyr)
df %>%
  group_by(id) %>%
  filter(min_rank(date)==1)
# Source: local data frame [2 x 2]
# Groups: id
# 
#   id date
# 1  1    2
# 2  2    1

i.e. for each id get the first according to date.

Based on a similar stackoverflow question (Create an "index" for each element of a group with data.table), I came up with this

library(data.table)
dt <- data.table(df)
setkey(dt, id, date)
for(k in unique(dt$id)){
  dt[id==k, index := 1:.N]
}
dt[index==1,]

But it seems like there should be a one-liner for this. Being unfamiliar with data.table I thought something like this

dt[,,mult="first", by=id]

should work, but alas! The last bit of code seems like it should group by id and then take the first (which within id would be determined by date since I've set the keys in this way.)

EDIT

Thanks to Ananda Mahto, this one-liner will now be in my data.table repertoire

dt[,.SD[1], by=id]
#    id date
# 1:  1    2
# 2:  2    1

Would something like this work for you: `setkey(as.data.table(df), id, date)[, .SD[1], by = id]`? — A5C1D2H2I1M1N2O1R2T1, Nov 06 '14 at 10:46
@AnandaMahto, Ok, added a benchmark on a huge data set, and you are right, everything performs more or less the same, except that binary search is a bit better — David Arenburg, Nov 06 '14 at 11:29
@DavidArenburg, are you sure it's not just `sequence` that is slowing things down :-) — A5C1D2H2I1M1N2O1R2T1, Nov 06 '14 at 13:34
@AnandaMahto, it maybe, cause for different `seed`s I'm getting different results too. Either way, it is a good thing to introduce binary search to `data.table` users — David Arenburg, Nov 06 '14 at 13:35

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-11-06T11:09:40.493

6

Working directly with your source data.frame, you can try:

setkey(as.data.table(df), id, date)[, .SD[1], by = id]
#    id date
# 1:  1    2
# 2:  2    1

Extending your original idea, you can just do:

dt <- data.table(df)
setkey(dt, id, date)
dt[, index := sequence(.N), by = id][index == 1]
#    id date index
# 1:  1    2     1
# 2:  2    1     1

It might be that at a certain scale, David is correct about head vs [1], but I'm not sure what scale that would be.

set.seed(1)
nrow <- 10000
ncol <- 20

df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))

fun1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
fun2 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]

library(microbenchmark)
microbenchmark(fun1(), fun2())
# Unit: milliseconds
#    expr       min        lq      mean    median        uq      max neval
#  fun1() 12.178189 12.496777 13.400905 12.808523 13.483545 30.28425   100
#  fun2()  4.474345  4.554527  4.948255  4.620596  4.965912  8.17852   100

edited Nov 06 '14 at 11:09

answered Nov 06 '14 at 10:52

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

Why do you use `.SD` instead of `date`? Why don't you use `setDT`? – Roland Nov 06 '14 at 10:54
@Roland, is it mandatory that I do those things? What if there are more columns other than "date"? What if I don't want to convert my input to a `data.table`? – A5C1D2H2I1M1N2O1R2T1 Nov 06 '14 at 10:56
No, of course not. The question is specific for one column. Why wouldn't you? – Roland Nov 06 '14 at 10:58
`sequence` is an `unlist` and `lapply` wrapper, why not just `seq_len(.N)`? – David Arenburg Nov 06 '14 at 11:00
@Roland, because I strongly feel that SO shouldn't be about just answering a single person's question. I don't always want to be using `setDT` and then `setkey` if I don't have an easy way to go back to the earlier sorting of my `data.frame` (in case that were important). – A5C1D2H2I1M1N2O1R2T1 Nov 06 '14 at 11:00
@DavidArenburg, there is an "edit" link and you are more than welcome to use it. – A5C1D2H2I1M1N2O1R2T1 Nov 06 '14 at 11:02
I don't do that without OPs agreement (unless there is a mistake in the answer) – David Arenburg Nov 06 '14 at 11:03

David Arenburg · Answer 2 · 2014-11-06T11:28:45.607

Here's another option using data.tables binary search

setkey(dt[, indx := seq_len(.N), by = id], indx)[J(1)]
#    id date indx
# 1:  1    2    1
# 2:  2    1    1

Some benchmarks: It seems that all the methods perform more or less the same, but on huge data set (1e+06*1e+2) binrary search wins

set.seed(1)
nrow <- 1e6
ncol <- 1e2

df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))
library(data.table)

funAM1 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]
funAM2 <- function() setkey(as.data.table(df), X1, X2)[, index := sequence(.N), by = X1][index == 1]
funDA1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
funDA2 <- function() setkey(as.data.table(df)[, indx := seq_len(.N), by = X1], X1)[J(1)]

library(microbenchmark)
Res <- microbenchmark(funAM1(), funAM2(), funDA1(), funDA2())
Res
# Unit: milliseconds
#     expr      min       lq   median       uq      max neval
# funAM1() 737.5690 758.3015 771.9344 794.1417 910.1019   100
# funAM2() 631.7822 693.8286 704.6912 729.6960 806.5556   100
# funDA1() 757.0327 772.4353 784.3107 810.0759 938.6344   100
# funDA2() 564.7291 578.1089 587.6470 611.7269 740.4077   100
boxplot(Res)

enter image description here

Use `data.table` to get first of subgroup based on a variable

2 Answers2

Linked