2

For a model I'm building, I want to create multiple lag terms for every field/vector in my data table:

For example, with the following data table:

    a<-c('x','x','x','y','y','y')  
    b<-runif(6, min=0, max=20)  
    c<-runif(6, min=50, max=1000)  
    df<-as.data.table(data.frame(a,b,c))    

I can use the following code to create 2 lag terms for variable b within each group a:

    df[,c(paste("b","_L",1:2,sep="")):=lapply(1:2, function(i) c(rep(NA, i),head(b, -i))),by=a]

However, my problem comes when I try to apply this code to a large data table (100+ variables), I would not want to repeat 100+ lines of code (1 line for each variable).

I tried to put the code inside of a loop with a list of variable names, but the variable names in the list cannot seem to be recognized or passed into the code properly:

    looplist <- colnames(df[,!1])  
    for (l in looplist) {
      df[,c(paste(l,"_L",1:2,sep="")):=lapply(1:2, function(i) c(rep(NA, i),head(l, -i))),by=a]
    } 

Any advice on how to make this loop work across variables, or any other methods to accomplish the same objective (create multiple LAG terms for each and every variable in the data table) will be greatly appreciated!

John
  • 23
  • 2

2 Answers2

7

data.table and Map to handle the looping:

vars <- c("b","c")
rpv  <- rep(1:2, each=length(vars))
df[, paste(vars, "lag", rpv, sep="_") := Map(shift, .SD, rpv), by=a, .SDcols=vars]

#   a         b        c   b_lag_1  c_lag_1  b_lag_2  c_lag_2
#1: x 10.863180 393.9568        NA       NA       NA       NA
#2: x  6.139258 537.9199 10.863180 393.9568       NA       NA
#3: x 11.896448 483.8036  6.139258 537.9199 10.86318 393.9568
#4: y 18.079188 509.6136        NA       NA       NA       NA
#5: y  5.463224 233.6991 18.079188 509.6136       NA       NA
#6: y  6.363724 869.8406  5.463224 233.6991 18.07919 509.6136
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • @agstudy can you explain why my answer doesn't satisfy the question? Because it does not programatically scale to `N` lags? – Marius Sep 13 '17 at 00:08
  • @agstudy OK, I thought the issue was more with large numbers of variables, but yes, my answer would not work well with large numbers of lags. – Marius Sep 13 '17 at 02:08
  • Hi @thelatemail, I would like to use your answer here for a different problem: how do I change `shift` function with a power function? (Every element of vars needs to be raised to the power of `rpv`, where `rpv` is a numeric vector - so the function returns b^2, c^2, b^3, c^3, etc) – GNicoletti Jan 06 '21 at 10:46
  • 1
    @joeuk - I think if you replace `shift` with `\`^\`` , including the backticks, it should work. – thelatemail Jan 06 '21 at 11:10
  • yes, it works! Thank you very much! but the question now is: how do you generally define the first element of `Map`? if I have a custom function, how do I apply it into the `Map` structure? do I have to declare it first? Or can I call it straight into it? Eg: test_function <- function(x) {(x+10)/4} - Thank you – GNicoletti Jan 06 '21 at 11:27
  • 1
    @joeuk - you can define it first separately *or* declare it inline with function(x,y) etc. Just got to make sure the number of arguments you supply to Map matches the number of arguments to the function. – thelatemail Jan 06 '21 at 11:30
2

Here's a way to do it with dplyr:

df %>%
    group_by(a) %>%
    mutate_all(funs(lag1 = lag(., 1), lag2 = lag(., 2)))

Output:

       a         b        c    b_lag1   c_lag1   b_lag2   c_lag2
  <fctr>     <dbl>    <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
1      x  6.663691 689.2483        NA       NA       NA       NA
2      x 11.759130 397.8902  6.663691 689.2483       NA       NA
3      x  3.888010 467.9758 11.759130 397.8902 6.663691 689.2483
4      y  6.221436 355.5437        NA       NA       NA       NA
5      y  2.390940 701.2719  6.221436 355.5437       NA       NA
6      y 17.141815 175.4642  2.390940 701.2719 6.221436 355.5437
Marius
  • 58,213
  • 16
  • 107
  • 105