R/data.table: Partial rolling join

Question

I have the following data structure:

> dt
   ID MiscInfo       Date Val
1:  A   info_a 2000-01-01   0
2:  A   info_a 2000-01-03   3
3:  B   info_b 2001-01-01   1
4:  B   info_b 2001-01-04   5
5:  B   info_b 2001-01-07  13

Where Date has some missing ID-wise entries where Val == 0 and MiscInfo represents a set of N > 50 attribute variables. My end goal is to fill the missing entries such that the below structure is obtained.

> dt_pref
    ID MiscInfo       Date Val
 1:  A   info_a 2000-01-01   0
 2:  A   info_a 2000-01-02   0
 3:  A   info_a 2000-01-03   3
 4:  B   info_b 2001-01-01   1
 5:  B   info_b 2001-01-02   0
 6:  B   info_b 2001-01-03   0
 7:  B   info_b 2001-01-04   5
 8:  B   info_b 2001-01-05   0
 9:  B   info_b 2001-01-06   0
10:  B   info_b 2001-01-07  13

Judging from similar requests, a rolling join is a decent route in achieving this. The issue I encounter is the inability to select which columns to roll, as described below:

drange = dt[, .(Date = seq(min(Date), max(Date), 1)), ID] %>% setkey(ID, Date)
dt[drange, roll = T]

    ID MiscInfo       Date Val
 1:  A   info_a 2000-01-01   0
 2:  A   info_a 2000-01-02   0
 3:  A   info_a 2000-01-03   3
 4:  B   info_b 2001-01-01   1
 5:  B   info_b 2001-01-02   1
 6:  B   info_b 2001-01-03   1
 7:  B   info_b 2001-01-04   5
 8:  B   info_b 2001-01-05   5
 9:  B   info_b 2001-01-06   5
10:  B   info_b 2001-01-07  13

In this case, the MiscInfo columns are appropriately rolled to my satisfaction, however the Val columns are of course also rolled whilst I wish to set them equal to 0. I can of course also take a step in the other direction by passing roll = 0:

dt[drange, roll = 0]
    ID MiscInfo       Date Val
 1:  A   info_a 2000-01-01   0
 2:  A     <NA> 2000-01-02  NA
 3:  A   info_a 2000-01-03   3
 4:  B   info_b 2001-01-01   1
 5:  B     <NA> 2001-01-02  NA
 6:  B     <NA> 2001-01-03  NA
 7:  B   info_b 2001-01-04   5
 8:  B     <NA> 2001-01-05  NA
 9:  B     <NA> 2001-01-06  NA
10:  B   info_b 2001-01-07  13

In this case, I can of course apply something like dt[is.na(Val), Val := 0], however handling the NA-entries of the MiscInfo array (which is very large) using a similar route is not computationally efficient, and I suspect that there is a join-related way of carrying out this operation. In short, I want to preset Val to 0 for the "filled" entries, and roll the remaining columns in an efficient manner. Any ideas?

Reproducibles:

dt = data.table(
  ID = c('A', 'A', 'B', 'B', 'B'),
  MiscInfo = c(rep('info_a', 2), rep('info_b', 3)),
  Date = as.Date(c('2000-01-01', '2000-01-03', '2001-01-01', '2001-01-04', '2001-01-07')),
  Val = c(0,3,1,5,13)
) %>% setkey(ID, Date)

dt_pref = data.table(
  ID = c(rep('A', 3), rep('B', 7)), 
  MiscInfo = c(rep("info_a", 3), rep("info_b", 7)), 
  Date = as.Date(c(10957, 10958, 10959, 11323, 11324, 11325, 11326, 11327, 11328, 11329), origin = '1970-01-01'),
  Val = c(0, 0, 3, 1, 0, 0, 5, 0, 0, 13)
)

`drange <- dt[drange, -"Val",on = .(ID, Date), roll = TRUE]; drange[, Val := 0]; drange[dt, Val := i.Val, on = .(ID, Date)]` — Roland, Oct 05 '20 at 11:58
Another way that also workes on `drange`: `drange <- dt[, .(Date = seq(min(Date), max(Date), 1), Val = 0), by = .(ID, MiscInfo)]; drange[dt, Val := i.Val, on = .(ID, Date)]` — mt1022, Oct 05 '20 at 12:05
What is the algorithm for filling in `MiscInfo` carry forward from last observation? — s_baldur, Oct 05 '20 at 12:30
Thank you all. Let me complicate it slightly further by introducing another Val-column, hence we now work with `Val1` and `Val2` (both to be set zero). @Roland, I suppose that I just set `drange[dt, ´:=´(Val1=i.Val1, Val2=i.Val2), on = .(ID, Date)]` in the second step. @sindri_baldur, have you omitted the final `)`? I just double-checked and it reproduces for me. — JDG, Oct 05 '20 at 12:38
I suppose it's a pkg version issue, I amended `origin = '1970-01-01'` to `as.Date`, should work now. @sindri_baldur — JDG, Oct 05 '20 at 12:48
At least since version v3.0.3 as.Date() the documentation says it *will accept numeric data (the number of days since an epoch), but only if origin is supplied.* https://www.rdocumentation.org/packages/base/versions/3.0.3/topics/as.Date — s_baldur, Oct 05 '20 at 12:53

score 2 · Accepted Answer · answered Oct 05 '20 at 13:18

2

May be this could be used, even for more complex cases :

merge(dt, 
dt[, .(Date = seq.Date(from = min(Date), to = max(Date), by = 1)), by = c("ID", "MiscInfo") ], 
      by = c("ID", "Date"), 
      all = TRUE)[, .(ID, Date, MiscInfo.y, Val = case_when(is.na(Val) ~ 0, 
                                                      TRUE ~ Val))]

answered Oct 05 '20 at 13:18

MrSmithGoesToWashington

1,048
10
23

Thank you. Whilst this works for the supplied example, `MiscInfo` is in reality a set of ~80 variables. Any ideas for this case? The `j`-statements become tedious having to supply all `MiscInfo` columns. – JDG Oct 06 '20 at 07:44
1

Is it possible to create one variable v for identifying all possible combinaitions of these 80 variables ? Then use only this one, and then, if needed, merge with the table of all 80 variable, according to v ? – MrSmithGoesToWashington Oct 06 '20 at 09:06

R/data.table: Partial rolling join

1 Answers1