1

I have a data.table with various time series as columns for which I want to add lags, differences and potentially some computed columns. Each grp has a fixed number of rows and there are many grp blocks stacked on top of one another in this data.table. Here is a simple example of the data.

    grp period          y1         y2
 1:   a      1  0.96594315  0.2395888
 2:   a      2 -0.18635737  1.1289055
 3:   a      3  1.78466563  3.5153819
 4:   a      4 -0.53847076  1.6550108
 5:   a      5 -1.10865388 -0.4006585
 6:   b      1 -0.05804824 -0.3389047
 7:   b      2 -1.38495552  1.1148454
 8:   b      3 -0.21258003 -0.6976163
 9:   b      4 -0.79988363  0.4506847
10:   b      5  1.30752737 -0.9655459

I have a function which generates all of the additional columns for a single grp when it is passed the block of data. Now I want to do this operation (using the function!) for every grp without having to specify the new column names on the LHS of the :=, without coping the existing columns back out of the function and without having to create a new data.table The reproducible example is below.

dt = data.table(grp = rep(letters[1:2], each=5), 
                period = rep(1:5,2), 
                y1 = rnorm(10),
                y2 = rnorm(10))
my.fun = function(dt.sub) {
    y1.lag = c(NA, dt.sub$y1[1:(nrow(dt.sub)-1)])
    y2.lag = c(NA, dt.sub$y2[1:(nrow(dt.sub)-1)])
    y1y2 = dt.sub$y1 * dt.sub$y2
    list(y1.lag = y1.lag, y2.lag = y2.lag, y1y2 = y1y2) #
}

The following code produces exactly what I am after, namely the original data with added columns, but I do not want to have to copy the result into another data.table object as this consumes more memory than I have available (yes I need to move to 64-bit pronto!).

EDIT: I am keen to use the existing function rather than hard code the full set of expressions in the j portion. This is because a) I might want to change the columns returned on the fly, b) readability - there are a large number of column added in the real example, c) maintainability - multiple scripts might be required to do this and I'd rather change it just once in the function.

dt = dt[,c(.SD, my.fun(.SD)),by=grp]    # the problem here is using dt =

So I tried using the := (in backticks) to create the new columns within the existing object, but that returns an error as I have not specified the LHS new column names. Given that the above code provides exactly what I want without naming the columns (they are already in the list), I cannot understand why the code below fails.

dt[,`:=`(my.fun(.SD)),by=grp]

Error in `[.data.table`(dt, , `:=`(my.fun(.SD)), by = grp) : 
  In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.

Clearly this is a toy example and my data is > 500k records with > 30 columns. Any help gratefully received, thanks.

EDIT: I have since spotted a question (R data.table grouping for lagged regression) which deals with grouping and roll to calculate lagged values within groups. This does not resolve my specific problem however it does provide a different approach which is very relevant for the matter at hand.

Community
  • 1
  • 1
Matt Weller
  • 2,684
  • 2
  • 21
  • 30
  • My concern is that the repeated use of `dt.sub$` will be calling `[.data.frame` which would reduce the data.table efficiencies. – IRTFM Dec 21 '13 at 16:11

2 Answers2

2

My error when I tried to use := within a list on the RHS produced the advice at the console to read: help(":=") and once I did that I achieved enlightenment:

help(":=")  # advice about  `multiple :=`

dt[ , `:=`(y1.lag = c(NA, head(y1,-1) ),
              y2.lag = c(NA, head(y2,-1) ),
              y1y2 = y1*y2) ,by=grp]
#------------    
dt
    grp period         y1          y2     y1.lag      y2.lag        y1y2
 1:   a      1 -0.2127395  1.33549660         NA          NA -0.28411285
 2:   a      2 -2.2005742 -0.07679158 -0.2127395  1.33549660  0.16898556
 3:   a      3  0.3857444 -0.47996397 -2.2005742 -0.07679158 -0.18514341
 4:   a      4 -1.5117554  0.50728778  0.3857444 -0.47996397 -0.76689506
 5:   a      5  1.7713902 -0.03092824 -1.5117554  0.50728778 -0.05478598
 6:   b      1  0.5033163  0.69815100         NA          NA  0.35139079
 7:   b      2  0.1125835 -2.19959623  0.5033163  0.69815100 -0.24763815
 8:   b      3  1.0252230 -1.76477546  0.1125835 -2.19959623 -1.80928832
 9:   b      4 -0.5484611 -1.35167910  1.0252230 -1.76477546  0.74134341
10:   b      5  1.3801637  0.67293665 -0.5484611 -1.35167910  0.92876276

It's rather neat to see that calling := as a function with a pairlist works like do.call might with a regular list. [Edit} To address the request to "externalize" the lagging specification but still coding to work on particular columns:

lag.y1.y2.expr =  expression(`:=`(
    y1.lag = c(NA, head(y1, -1) ),
    y2.lag = c(NA, head(y2, -1)),
    y1y2 = y1 * y2  ) )

dt[, eval( lag.y1.y2.expr ), , by='grp' ]

I don't see this as having any deficiencies relative to your code, since your code did not allow (or even hint at) programmatic substitution of column names. If you wanted a somewhat more maintainable arrangement with a single entry point for the possible modification of the names of the columns this also succeeds:

my.expr =  substitute(`:=`(
                          y1.lag = c(NA, head(X1, -1) ),
                          y2.lag = c(NA, head(X2, -1) ),
                          y1y2 = y1 * y2 
                           ) ,
                      list(X1=quote(y1),X2=quote(y2) ) )

dt[, eval(my.expr), , by='grp' ]

And I suspect that you could expect success using bquote which sometimes simplifies working with R expression objects..

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Strange I didn't get the same error message with the help pointer. This solution works fine, my question failed to say that for maintainability and readability I'd rather use a function than explicitly list all of the new columns within the j clause. My mistake. I should update the question. I presume this is the most efficient method of all as there is no function call. Like the use of head() btw. – Matt Weller Dec 21 '13 at 11:23
  • This was my effort that produced the "friendly advice": `dt[ , list(y1.lag := c(NA, head(y1,-1) ), y2.lag := c(NA, head(y2,-1) ), y1y2 := y1*y2) ]`. – IRTFM Dec 21 '13 at 15:43
  • Thanks, I've gone with the other solution purely for maintainability/readability despite the performance improvements of doing it inline. Both solutions are SO much quicker than the `ddply` method. My next step would be to utilise `roll` but that's another task! Didn't know about `help(":=")` - thanks again. – Matt Weller Dec 23 '13 at 13:22
1

For the same as IShouldBuyABoat's answer result try:

data.table(dt, dt[, my.fun(.SD), by = grp][, grp := NULL])
   grp period       y1       y2   y1.lag  y2.lag      y1y2
1:   a      1  1.36677 -0.81025       NA      NA -1.107425
2:   a      2  0.43528  1.04277  1.36677 -0.8102  0.453895
3:   a      3 -1.40229  0.66223  0.43528  1.0428 -0.928633
4:   a      4  1.43362  0.10293 -1.40229  0.6622  0.147560
5:   a      5  0.46713  0.72508  1.43362  0.1029  0.338705
6:   b      1 -0.04418 -0.20014       NA      NA  0.008843
7:   b      2  1.32390  0.19651 -0.04418 -0.2001  0.260160
8:   b      3 -0.82543  1.11483  1.32390  0.1965 -0.920215
9:   b      4 -1.26415  0.53213 -0.82543  1.1148 -0.672698
10:  b      5  0.14549  0.04128 -1.26415  0.5321  0.006005

Note: you can use .SDcols for specify a columns passed to my.fun. [, grp := NULL] to supress duplicate grp column.

Artem Klevtsov
  • 9,193
  • 6
  • 52
  • 57
  • Unfortunately there is no `cbind.data.table` function, so this will be falling back onto `cbind data.frame`. It cannot be expected to maintain typical data.table performance – IRTFM Dec 21 '13 at 16:26
  • 1
    The performance is fine for my needs and smashes `ddply` out of the water, so thanks for the answer. – Matt Weller Dec 23 '13 at 13:45
  • The slowness of ddply is well-known, but I understand that hadley is working on an update that performs better in terms of time efficiency. – IRTFM Dec 23 '13 at 16:55
  • @IShouldBuyABoat: `data.table(dt, dt[, my.fun(.SD), by = grp][, grp := NULL])` produce the same result. Answer edited. – Artem Klevtsov Jan 01 '14 at 04:36