R - data.table by group by - order of keys and missing keys

Question

If I have a data.table

> DT1 <- data.table(A=rep(c('A', 'B'), 3),
                    B=rep(c(1,2,3), 2),
                    val=rnorm(6), key='A,B')
> DT1
   A B        val
1: A 1 -1.6283314
2: B 2  0.5337604
3: A 3  0.9991301
4: B 1  1.1421400
5: A 2  0.1230095
6: B 3  0.4988504

and I want to subset by more than one key, like so:

> DT1[J('A', 1)]                                                               
   A B          val
1: A 1 -0.004898047

However, the join is dependent on the order of the keys, so the value for key A must always come first. This will not work, even if you specify names (either as J() or as a list()):

> DT1[J(1, 'A')]
Error in `[.data.table`(DT1, J(1, "A")) : 
  x.'A' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns.

> DT1[J(B=1, A='A')]
Error in `[.data.table`(DT1, J(B = 1, A = "A")) : 
  x.'A' is a character column being joined to i.'B' which is type 'double'. Character columns must join to factor or character columns.

Is there a syntax where you can do this kind of grouping by i without knowing the order of the keys?

Added: Another use case would be if I wanted to subset by B only and not by A -- is there a way to skip keys in the subsetting? The current answers that create wrapper functions for J don't seem to allow this.

EDIT: Some have mentioned doing it the data.frame way. I know that you can use a vector of logical values to subset, but this is slow as it does a scan of the entire table:

> DT1 <- data.table(A=rep(c(1,2,3), 100000), B=rep(c('A', 'B'), 150000), val=rnorm(300000), key='A,B')
> system.time(DT1[DT1$A==1, DT1$B=="A"])                                       
   user  system elapsed 
  0.080   0.000   0.054 
> system.time(DT1[J(1, 'A')])
   user  system elapsed 
  0.004   0.000   0.004

Some references to related discussions: (1)

Your example does not appear to be reproducible because of `rnorm`. What's wrong with `DT1[DT1$A=='A'&DT1$B==1,]`? — Doctor Dan, Aug 14 '13 at 17:29
Please see my edit -- basically the method is not a join and is much slower. — hgcrpd, Aug 15 '13 at 01:26
to "skip" keys you have to join by *all* previous keys, e.g. `DT1[J(unique(A), 1)]` and I'm guessing at some point the benefits of the join become moot when doing this — eddi, Aug 15 '13 at 12:42

score 2 · Answer 1 · answered Aug 14 '13 at 19:23

2

How about...

myJ <- function(key,...) do.call(data.table,list(...))[,key,with=FALSE]
DT1[myJ(B=1,A='A',key=key(DT1))]

or

myJ1 <- function(...) myJ(key(DT1),...)
DT1[myJ1(B=1,A='A')]

This way, you have to label the items correctly.

answered Aug 14 '13 at 19:23

Frank

66,179
8
96
180

I usually find it convenient to pay attention to the order and save precious keystrokes/clutter by omitting the names. (That is, I use the normal `J` syntax...) – Frank Aug 14 '13 at 19:25

score 2 · Answer 2 · answered Aug 14 '13 at 20:42

2

In the spirit of @Frank's answer, but trying to get the key automagically:

myJ2 = function(...) {
  # 'x' a couple of frames above is where the original data.table sits
  data.table(..., key = key(get('x', parent.frame(n = 3))))
}

DT1[myJ2(B=1, A='A')]
#   A B       val
#1: A 1 0.4328698

answered Aug 14 '13 at 20:42

eddi

49,088
6
104
155

+1. Yeah, that's better. Besides your automagicks, I forgot that "key" could be an argument to `data.table`. – Frank Aug 14 '13 at 21:09

R - data.table by group by - order of keys and missing keys

2 Answers2

Linked