more efficient way to take every nth element in data.table by factors

Question

This thread has discussed about doing it for data frame. I want to do a little more complicated than that:

dt <- data.table(A = c(rep("a", 3), rep("b", 4), rep("c", 5)) , B = rnorm(12, 5, 2))
dt2 <- dt[order(dt$A, dt$B)] # Sorting
# Always shows the factor from A
do.call(rbind, by(
  dt2, dt2$A,
  function(x) data.table(A = x[,A][1], B = x[,B][4])
              )
        )
#This is to reply to Vlo's comment below. If I do this, it will return both row as 'NA'
    do.call(rbind,
        by(dt2, dt2$A, function(x) x[4])
      )
# Take the max value of B according to each factor A
do.call(rbind, by(dt2, dt2$A,
                  function(x) tail(x,1))
                  )
        )

What are more efficient way(s) to do this with data.table native functions?

You should only use `order` within `data.table` scope when you want to sort in decreasing order. In your case, instead of creating `dt2`, just do `setkey(dt, A, B)` and work with your original data — David Arenburg, Aug 07 '14 at 19:46
@DavidArenburg, or if you really want to preserve the original data. Also, there's a new function `setorder` in 1.9.3, with which you can order *by reference* in any order :). — Arun, Aug 07 '14 at 20:13

Arun · Accepted Answer · 2014-08-07T19:43:28.037

6

In data.table, you can refer to columns as if they are variables within the scope of dt. So, you don't need the $. That is,

dt2 = dt[order(A, B)] # no need for dt$

is sufficient. And if you want the 4th element of B for every group in A:

dt2[, list(B=B[4L]), by=A]
#    A        B
# 1: a       NA
# 2: b 6.579446
# 3: c 6.378689

Refer to @Vlo's answer for your second question.

From the way you're using data.tables, it seems like you've not gone through any vignettes or talks. It'd be helpful for you to check out the Introduction and the FAQ vignettes or tutorials from the homepage; especially, Matt's @user2014 tutorial amidst others.

edited Aug 07 '14 at 19:43

answered Aug 07 '14 at 19:08

Arun

116,683
26
284
387

Thanks for the answer. I went through the FAQ, but did not get it much. The point about applying `list()` command to `j` you showed here clarified lot of things for me. – biocyberman Aug 07 '14 at 19:29
2

@Arun, why not just `setkey(dt, A, B)`? As the only reason he created `dt2` was in order to sort `dt` – David Arenburg Aug 07 '14 at 19:45
That's true. I have just got that from Matt's presentation that @Arun pointed out. – biocyberman Aug 07 '14 at 19:57

score 3 · Answer 2 · edited Aug 07 '14 at 19:10

3

First statement makes no sense to me, here is the second

# Take the max value of B according to each factor A
dt2[, list(B=max(B)), by=A]

edited Aug 07 '14 at 19:10

Arun

116,683
26
284
387

answered Aug 07 '14 at 18:57

Vlo

3,168
13
27

more efficient way to take every nth element in data.table by factors

2 Answers2

Linked