3

I am trying to use data.table for an operation I don't if data.table is designed for. Suppose I have the following

set.seed(1)
id<-rep(LETTERS[1:6], each=3)
event<-rep(letters[1:4], each=3, length.out=18)
order<-rep(1:3, 6)
var<-rnorm(18)

dt<-data.table(id, event, order, var)
setkey(dt, id)

    id event order         var
 1:  A     a     1 -0.62645381
 2:  A     a     2  0.18364332
 3:  A     a     3 -0.83562861
 4:  B     b     1  1.59528080
 5:  B     b     2  0.32950777
 6:  B     b     3 -0.82046838
 7:  C     c     1  0.48742905
 8:  C     c     2  0.73832471
 9:  C     c     3  0.57578135
10:  D     d     1 -0.30538839
11:  D     d     2  1.51178117
12:  D     d     3  0.38984324
13:  E     a     1 -0.62124058
14:  E     a     2 -2.21469989
15:  E     a     3  1.12493092
16:  F     b     1 -0.04493361
17:  F     b     2 -0.01619026
18:  F     b     3  0.94383621

and I need to calculate in a new column the Euclidean distance between the values in var having different id but same event type, maintaining the given order of observations.

I understood that with the grouping by command in the data.table statement I divide the data into chunks which I can use in connection with other chunks from other data.tables but can I do the same thing within the same data.table?

To be clear, this is what I'd like to obtain

    id event order         var   euclid
 1:  A     a     1 -0.62645381 3.097720
 2:  A     a     2  0.18364332 3.097720
 3:  A     a     3 -0.83562861 3.097720
 4:  B     b     1  1.59528080 2.433635
 5:  B     b     2  0.32950777 2.433635
 6:  B     b     3 -0.82046838 2.433635
 7:  C     c     1  0.48742905       NA
 8:  C     c     2  0.73832471       NA
 9:  C     c     3  0.57578135       NA
10:  D     d     1 -0.30538839       NA
11:  D     d     2  1.51178117       NA
12:  D     d     3  0.38984324       NA
13:  E     a     1 -0.62124058 3.097720
14:  E     a     2 -2.21469989 3.097720
15:  E     a     3  1.12493092 3.097720
16:  F     b     1 -0.04493361 2.433635
17:  F     b     2 -0.01619026 2.433635
18:  F     b     3  0.94383621 2.433635

Many thanks!

Riccardo
  • 743
  • 2
  • 5
  • 14
  • What if there's another `event=a`, say: `id=G, event=a, order=1:3`, how would you want to go about it? – Arun Feb 12 '14 at 00:04
  • @Arun Great question. Probably I would like to compute the distance between each possible couple (like A-E, A-G, E-G) and sum all the distances at the end. Do you think it's feasible? – Riccardo Feb 12 '14 at 00:18
  • If you're going to use `rnorm` in your example it's probably wise to use `set.seed` to make it reproducible. I'm also stumped as to how you're calculating your distance in the example. – thelatemail Feb 12 '14 at 00:25
  • @thelatemail You're right, I forgot to `set.seed`. Thank you for pointing that out. To calculate the Euclidean distance, I followed the advice given [here](http://stackoverflow.com/questions/5559384/how-to-find-the-euclidean-distance-of-two-vectors-in-r) – Riccardo Feb 12 '14 at 00:31

1 Answers1

1

Not sure about efficiency, but this should work at a basic level:

dt[, euclid:= dist(xtabs(var ~ id + order, data=.SD)), by=event]

#    id event order         var   euclid
# 1:  A     a     1 -0.62645381 3.097720 # snip
# 4:  B     b     1  1.59528080 2.433635 # snip
# 7:  C     c     1  0.48742905       NA # snip
#10:  D     d     1 -0.30538839       NA # snip
#13:  E     a     1 -0.62124058 3.097720 # snip
#16:  F     b     1 -0.04493361 2.433635 # snip

To account for @Arun's issue where there are 3+ groups, you can sum the dist outputs, though you will get 0 instead of NA's where there is only one group:

dt[, euclid := sum(dist(xtabs(var ~ id + order, data=.SD))), by=event]
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Genius answer. Thank you very much. It also taught me things I hadn't completely understood on the use of `xtabs` – Riccardo Feb 12 '14 at 10:55