I have a large data set that has about 100M rows, with 8M groups. I'm trying to do a recursive calculation to generate multiple new columns and I keep getting stuck. This is sample data, grouped by columns A
and D
:
library(data.table)
DT = data.table(A = c(rep(1,5),rep(2,3)),
D = c(rep(1,3),rep(2,2),rep(1,2),rep(2,1)),
Qc = c(10,10,10,15,15,30,30,40),
Qe = c(20,20,20,25,25,10,10,5),
Dc = c(5,0,25,0,20,20,50,10),
De = c(0,30,0,10,0,0,0,0))
I need to create new columns that use previous values of these newly generated columns (except for the first row of each group, which initializes the recursion). This code creates the output I'm looking for, but I'm hoping to do this in a data.table
way. I've also tried an .SD
solution but it is incredibly slow (50+ hours).
# for initial row of each group
DT[c(1,4,6,8),c("Sc","Se"):=list(pmin(Qc,Dc),pmin(Qe,De))]
DT[c(1,4,6,8), c("NQc", "NQe"):=list(Qc-Sc+Se,Qe-Se+Sc)]
# for recursion
for(i in c(2:3,5,7)){
i=as.integer(i)
set(DT,i,"Qc",DT[i-1L,NQc])
set(DT,i,"Qe",DT[i-1L,NQe])
DT[i, c("Sc","Se"):=list(pmin(Qc,Dc),pmin(Qe,De))]
DT[i, c("NQc", "NQe"):=list(Qc-Sc+Se,Qe-Se+Sc)]
}
I'm trying to use shift, but I can't figure out how to do the initialization properly without a loop per A
group, and even this doesn't seem to work:
DT[,c("Se","Sc","NQe","NQc"):=list(min(shift(NQe),De),min(shift(NQe),Dc),
Qe-Se+Sc,Qc-Sc+Se),by=A]
Any ideas would be VERY appreciated.