I have a data.table
with various time series as columns for which I want to add lags, differences and potentially some computed columns. Each grp
has a fixed number of rows and there are many grp
blocks stacked on top of one another in this data.table
. Here is a simple example of the data.
grp period y1 y2
1: a 1 0.96594315 0.2395888
2: a 2 -0.18635737 1.1289055
3: a 3 1.78466563 3.5153819
4: a 4 -0.53847076 1.6550108
5: a 5 -1.10865388 -0.4006585
6: b 1 -0.05804824 -0.3389047
7: b 2 -1.38495552 1.1148454
8: b 3 -0.21258003 -0.6976163
9: b 4 -0.79988363 0.4506847
10: b 5 1.30752737 -0.9655459
I have a function which generates all of the additional columns for a single grp
when it is passed the block of data. Now I want to do this operation (using the function!) for every grp
without having to specify the new column names on the LHS of the :=
, without coping the existing columns back out of the function and without having to create a new data.table
The reproducible example is below.
dt = data.table(grp = rep(letters[1:2], each=5),
period = rep(1:5,2),
y1 = rnorm(10),
y2 = rnorm(10))
my.fun = function(dt.sub) {
y1.lag = c(NA, dt.sub$y1[1:(nrow(dt.sub)-1)])
y2.lag = c(NA, dt.sub$y2[1:(nrow(dt.sub)-1)])
y1y2 = dt.sub$y1 * dt.sub$y2
list(y1.lag = y1.lag, y2.lag = y2.lag, y1y2 = y1y2) #
}
The following code produces exactly what I am after, namely the original data with added columns, but I do not want to have to copy the result into another data.table
object as this consumes more memory than I have available (yes I need to move to 64-bit pronto!).
EDIT: I am keen to use the existing function rather than hard code the full set of expressions in the j portion. This is because a) I might want to change the columns returned on the fly, b) readability - there are a large number of column added in the real example, c) maintainability - multiple scripts might be required to do this and I'd rather change it just once in the function.
dt = dt[,c(.SD, my.fun(.SD)),by=grp] # the problem here is using dt =
So I tried using the :=
(in backticks) to create the new columns within the existing object, but that returns an error as I have not specified the LHS new column names. Given that the above code provides exactly what I want without naming the columns (they are already in the list
), I cannot understand why the code below fails.
dt[,`:=`(my.fun(.SD)),by=grp]
Error in `[.data.table`(dt, , `:=`(my.fun(.SD)), by = grp) :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
Clearly this is a toy example and my data is > 500k records with > 30 columns. Any help gratefully received, thanks.
EDIT: I have since spotted a question (R data.table grouping for lagged regression) which deals with grouping and roll
to calculate lagged values within groups. This does not resolve my specific problem however it does provide a different approach which is very relevant for the matter at hand.