2

I have a large data set that has about 100M rows, with 8M groups. I'm trying to do a recursive calculation to generate multiple new columns and I keep getting stuck. This is sample data, grouped by columns A and D:

library(data.table)     
DT = data.table(A = c(rep(1,5),rep(2,3)),
                D = c(rep(1,3),rep(2,2),rep(1,2),rep(2,1)),
                Qc = c(10,10,10,15,15,30,30,40),
                Qe = c(20,20,20,25,25,10,10,5),
                Dc = c(5,0,25,0,20,20,50,10),
                De = c(0,30,0,10,0,0,0,0))

I need to create new columns that use previous values of these newly generated columns (except for the first row of each group, which initializes the recursion). This code creates the output I'm looking for, but I'm hoping to do this in a data.table way. I've also tried an .SD solution but it is incredibly slow (50+ hours).

# for initial row of each group
DT[c(1,4,6,8),c("Sc","Se"):=list(pmin(Qc,Dc),pmin(Qe,De))]
DT[c(1,4,6,8), c("NQc", "NQe"):=list(Qc-Sc+Se,Qe-Se+Sc)]

# for recursion 
for(i in c(2:3,5,7)){
i=as.integer(i)
set(DT,i,"Qc",DT[i-1L,NQc])
set(DT,i,"Qe",DT[i-1L,NQe])
DT[i, c("Sc","Se"):=list(pmin(Qc,Dc),pmin(Qe,De))]
DT[i, c("NQc", "NQe"):=list(Qc-Sc+Se,Qe-Se+Sc)]
}

I'm trying to use shift, but I can't figure out how to do the initialization properly without a loop per A group, and even this doesn't seem to work:

DT[,c("Se","Sc","NQe","NQc"):=list(min(shift(NQe),De),min(shift(NQe),Dc),
                               Qe-Se+Sc,Qc-Sc+Se),by=A]

Any ideas would be VERY appreciated.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Karthik
  • 31
  • 3
  • Do you realize that the first 2 lines of your `for` loop don't affect the final values in DT? You're assigning values to the same symbols that exist inside the data.table. When the next 4 lines execute, they take the value from inside the data.table. – Adam Hoelscher Feb 24 '16 at 21:30
  • 1
    Assuming that what I mentioned in my other comment *is* a bug in your code, then you won't be able to use `shift` to do this in a simple way. This is because the calculation is recursive; you cannot set the final value of NQe for any given row until you've computed the value for the row before it. – Adam Hoelscher Feb 24 '16 at 21:35
  • @Adam Thanks for catching this -- I updated the code; your assumption was correct. I'm still a bit confused because it seems from this link [link](http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation/25084650) shows a implementation of a recursive function with shift. I'm not wedded to shift, but it seemed like the only other way I could do this without sending data to a special function with .SD -- this takes too long (50+ hours). Any ideas? Thanks for your help. – Karthik Feb 24 '16 at 21:56
  • @Karthik where exactly is recursion used there? I don't see it. – MichaelChirico Feb 24 '16 at 22:13
  • I don't think the calculation in that question is actually recursive. When they calculate `D` it depends on values of `C` and `B` (which are both fixed). When you're calculating a new value of `NQe` it depends on the `NQe` value from the prior row, and that value has to be calculated before it can be used. Maybe there is some algebraic voodoo you can do to derive a closed form expression, but I can't quickly see it. Also, 50+ hours = YUCK and I wish I had something better to say about it. :-/ – Adam Hoelscher Feb 24 '16 at 22:31

1 Answers1

2

I believe this works, but, as mentioned, recursion seems unavoidable, so I don't know how much of a speed-up this will give you. But at least we avoid looping over all groups by going within-group-index-by-within-group-index:

DT[ , grp_i := 1:.N, by = .(A, D)]
setkey(DT, grp_i)

#since the number of obs. per group varies,
#  will help to basically find out which groups still exist
#  at each index
remain <- lapply(is <- 1L:DT[ , max(grp_i)], function(n) unique(DT[.(n), .(A, D)]))

upd_cols <- c("Se", "Sc", "NQe", "NQc")
pre_cols <- c("NQe", "NQc")

setkey(DT, A, D, grp_i)
DT[.(remain[[1L]][ , grp_i := 1L]),
   (upd_cols)  := 
     .(Se <- pmin(Qe, De), Sc <- pmin(Qc, Dc),
       Qe - (del <- Se - Sc), Qc + del)]

for (ii in is[-1L]){
  rem <- remain[[ii]]
  prev <- DT[.(rem[ , grp_i := ii - 1L]), pre_cols, with = FALSE]
  DT[.(rem[ , grp_i := ii]), (upd_cols) :=
       .(Se <- pmin(TQe <- prev$NQe, De), Sc <- pmin(TQc <- prev$NQc, Dc),
         TQe - (del <- Se - Sc), TQc + del)]
}
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • This is great -- thank you -- I'm eager to try this. One quick question for you: my actual data has two key fields (A,D -- e.g 1,Z is one block of rows; 1,Y is a second block of rows, etc). I'm a bit stuck with altering the code for this. I think the alterations are straightforward with changing by = A to by = .(A,D) and setkey(DT,A,grp_i) to setkey(DT,A,D,grp_i) but I'm stuck on the "unique(A)" code. Any ideas on how to alter this? Many thanks for your help! – Karthik Feb 25 '16 at 04:08
  • Would help to have a working example. First guess is to replace the inner function with `unique(DT[.(n), .(key_1, key_2)], by = c("key_1", "key_2"))` – MichaelChirico Feb 25 '16 at 04:16
  • Thanks much for the response. I tried replacing the function, but still a little stuck in the mud. I updated the question with a working example with two key fields though. – Karthik Feb 25 '16 at 15:06
  • @Karthik updated. Not sure it's the best way to have done so but it works. – MichaelChirico Feb 25 '16 at 15:16
  • This is fantastic -- thank you! This actually ran on my machine in 10 minutes -- WAY better my previous 50 hours. – Karthik Feb 27 '16 at 22:00