7

Regrad to this Post, I have created an example to play with linear regression on data.table package as follows:

## rm(list=ls()) # anti-social
library(data.table)
set.seed(1011)
DT = data.table(group=c("b","b","b","a","a","a"),
                v1=rnorm(6),v2=rnorm(6), y=rnorm(6))
setkey(DT, group)
ans <- DT[,as.list(coef(lm(y~v1+v2))), by = group]

return,

   group (Intercept)        v1        v2
1:     a    1.374942 -2.151953 -1.355995
2:     b   -2.292529  3.029726 -9.894993

I am able to obtain the coefficients of the lm function.

My question is: How can we directly use predict to new observations ? If we have the new observations as follows:

new <- data.table(group=c("b","b","b","a","a","a"),v1=rnorm(6),v2=rnorm(6))

I have tried:

setkey(new, group)
DT[,predict(lm(y~v1+v2), new), by = group]

but it returns me strange answers:

    group         V1
 1:     a  -2.525502
 2:     a   3.319445
 3:     a   4.340253
 4:     a   3.512047
 5:     a   2.928245
 6:     a   1.368679
 7:     b  -1.835744
 8:     b  -3.465325
 9:     b  19.984160
10:     b -14.588933
11:     b  11.280766
12:     b  -1.132324

Thank you

Community
  • 1
  • 1
newbie
  • 917
  • 8
  • 21

1 Answers1

11

You are predicting onto the entire new data set each time. If you want to predict only on the new data for each group you need to subset the "newdata" by group.

This is an instance where .BY will be useful. Here are two possibilities

a <- DT[,predict(lm(y ~ v1 + v2), new[.BY]), by = group]

b <- new[,predict(lm(y ~ v1 + v2, data = DT[.BY]), newdata=.SD),by = group]

both of which give identical results

identical(a,b)
# [1] TRUE
a
#   group         V1
#1:     a  -2.525502
#2:     a   3.319445
#3:     a   4.340253
#4:     b -14.588933
#5:     b  11.280766
#6:     b  -1.132324
mnel
  • 113,303
  • 27
  • 265
  • 254
  • Nice. I knew that was the problem, I just couldn't sort out how to fix it. `.BY` is a new one for me. – thelatemail May 30 '14 at 06:08
  • @thelatemail - this is the first time I've managed to find a use for `.BY` – mnel May 30 '14 at 06:31
  • @mnel I'm new to data.table. I have read the .BY, but still don't get how it works. Could you explain ? – newbie May 30 '14 at 09:21
  • @newbie `.BY` is described in the help for `data.table` (`?data.table`). `.BY` is a list containing the values of the by variables. This means it can be used to join with other keyed data.tables to select the rows which match the current `BY` grouping. – mnel Jun 01 '14 at 22:54
  • 1
    Andrew Brooks wrote [a great article](http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/#calculate-a-function-over-a-group-using-by-excluding-each-entity-in-a-second-category) on data.table special symbols' usage, including .BY. It's a good read for those wanting to understand those methods better. – altabq Apr 20 '18 at 09:48