0

This thread has discussed about doing it for data frame. I want to do a little more complicated than that:

dt <- data.table(A = c(rep("a", 3), rep("b", 4), rep("c", 5)) , B = rnorm(12, 5, 2))
dt2 <- dt[order(dt$A, dt$B)] # Sorting
# Always shows the factor from A
do.call(rbind, by(
  dt2, dt2$A,
  function(x) data.table(A = x[,A][1], B = x[,B][4])
              )
        )
#This is to reply to Vlo's comment below. If I do this, it will return both row as 'NA'
    do.call(rbind,
        by(dt2, dt2$A, function(x) x[4])
      )
# Take the max value of B according to each factor A
do.call(rbind, by(dt2, dt2$A,
                  function(x) tail(x,1))
                  )
        )

What are more efficient way(s) to do this with data.table native functions?

Community
  • 1
  • 1
biocyberman
  • 5,675
  • 8
  • 38
  • 50
  • You should only use `order` within `data.table` scope when you want to sort in decreasing order. In your case, instead of creating `dt2`, just do `setkey(dt, A, B)` and work with your original data – David Arenburg Aug 07 '14 at 19:46
  • @DavidArenburg, or if you really want to preserve the original data. Also, there's a new function `setorder` in 1.9.3, with which you can order *by reference* in any order :). – Arun Aug 07 '14 at 20:13
  • 1
    @Arun, you guys keep to amaze me – David Arenburg Aug 07 '14 at 20:15

2 Answers2

6

In data.table, you can refer to columns as if they are variables within the scope of dt. So, you don't need the $. That is,

dt2 = dt[order(A, B)] # no need for dt$

is sufficient. And if you want the 4th element of B for every group in A:

dt2[, list(B=B[4L]), by=A]
#    A        B
# 1: a       NA
# 2: b 6.579446
# 3: c 6.378689

Refer to @Vlo's answer for your second question.

From the way you're using data.tables, it seems like you've not gone through any vignettes or talks. It'd be helpful for you to check out the Introduction and the FAQ vignettes or tutorials from the homepage; especially, Matt's @user2014 tutorial amidst others.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks for the answer. I went through the FAQ, but did not get it much. The point about applying `list()` command to `j` you showed here clarified lot of things for me. – biocyberman Aug 07 '14 at 19:29
  • 2
    @Arun, why not just `setkey(dt, A, B)`? As the only reason he created `dt2` was in order to sort `dt` – David Arenburg Aug 07 '14 at 19:45
  • That's true. I have just got that from Matt's presentation that @Arun pointed out. – biocyberman Aug 07 '14 at 19:57
3

First statement makes no sense to me, here is the second

# Take the max value of B according to each factor A
dt2[, list(B=max(B)), by=A]
Arun
  • 116,683
  • 26
  • 284
  • 387
Vlo
  • 3,168
  • 13
  • 27