0

I need to calculate the number of occurrences of specific events over specified time periods. Assume I have the following data:

set.seed(1453)
id = c(1,1,1,1,1,2,2,2,2,2)
x_1 = sample(0:1, 10, TRUE)
x_5 = sample(0:1, 10, TRUE)
date = c('2016-01-01', 
    '2016-02-01',
    '2016-02-23',
    '2016-03-04',
    '2016-04-01',
    '2016-01-01', 
    '2016-02-01',
    '2016-02-23',
    '2016-03-04',
    '2016-04-01'

)

df = data.frame(id,date=as.Date(date),snapshot_date = as.Date(date)+1,x_1,x_5)

TABLE 1. (INPUT)

id  date      snapshot_date  x_1 x_5
1 2016-01-01    2016-01-02   1   0
1 2016-02-01    2016-02-02   0   1
1 2016-02-23    2016-02-24   1   1
1 2016-03-04    2016-03-05   0   0
1 2016-04-01    2016-04-02   0   1
2 2016-01-01    2016-01-02   1   1
2 2016-02-01    2016-02-02   1   0
2 2016-02-23    2016-02-24   0   0
2 2016-03-04    2016-03-05   0   0
2 2016-04-01    2016-04-02   1   1

I need to calculate how many times x_1=1 and x_5=1 occurred within the last 3 months (per each month). So I first create dummy variables: if x_1 =1, then x_1_n = TRUE. Otherwise, x_1_n = FALSE. Similarly, for x_5_n. I also create the dates for three months going backwards.

df$x_1_n <- ifelse((df$x_1 ==1), TRUE, FALSE)
df$x_5_n <- ifelse(df$x_5==1, TRUE, FALSE)
library(lubridate)

for (i in 1:3) {

DATE_MO <- as.Date(df$snapshot_date) %m-% months(i)
df[,paste0("DATE_MO", i)] <- DATE_MO
}

I have variables x_1, x_5. I need to write a loop that goes through all of those variables x_1,x_5 and calculate the number of occurrences between certain dates. The original code runs and is correct. But I want to see how to simplify it using the for loop so that I don't have to manually copy-paste the code for each x_1 and x_5 as the number of x_'s and dates are bigger in the original version.

library(data.table)
df <- data.table(df)

df[,c("x1_dminus_mo1",
         "x1_dminus_mo2",
         "x1_dminus_mo3"


) :=. (df[x_1_n][df[,.(id,DATE_MO1,snapshot_date)], on=.
                  (id, date >= DATE_MO1, date < snapshot_date), .N, by = .EACHI] $N
   ,

   df[x_1_n][df[,.(id,DATE_MO2, DATE_MO1)], on=.
                  (id, date >= DATE_MO2, date < DATE_MO1), .N, by = .EACHI] $N

   ,

   df[x_1_n][df[,.(id,DATE_MO3, DATE_MO2)], on=.
                  (id, date >= DATE_MO3, date < DATE_MO2), .N, by = .EACHI] $N
)]

df[,c("x5_dminus_mo1",
  "x5_dminus_mo2",
  "x5_dminus_mo3"


) :=. (df[x_5_n][df[,.(id,DATE_MO1,snapshot_date)], on=.
             (id, date >= DATE_MO1, date < snapshot_date), .N, by = .EACHI] $N
   ,

   df[x_5_n][df[,.(id,DATE_MO2, DATE_MO1)], on=.
             (id, date >= DATE_MO2, date < DATE_MO1), .N, by = .EACHI] $N

   ,

   df[x_5_n][df[,.(id,DATE_MO3, DATE_MO2)], on=.
             (id, date >= DATE_MO3, date < DATE_MO2), .N, by = .EACHI] $N
 )]

I want to get the following table but using the loop.

TABLE 2 (OUTPUT)

df[,c(1,2,4,11,12,13)]
id       date x_1 x1_dminus_mo1 x1_dminus_mo2 x1_dminus_mo3
1 2016-01-01   1             1             0             0
1 2016-02-01   0             0             1             0
1 2016-02-23   1             1             1             0
1 2016-03-04   0             1             0             1
1 2016-04-01   0             0             1             0
2 2016-01-01   1             1             0             0
2 2016-02-01   1             1             1             0
2 2016-02-23   0             1             1             0
2 2016-03-04   0             0             1             1
2 2016-04-01   1             1             0             1
Yelena
  • 189
  • 1
  • 2
  • 13
  • https://stackoverflow.com/help/mcve – Nathan Werth Mar 14 '18 at 20:17
  • Please reopen this question as more details have been added. – Yelena Mar 15 '18 at 16:30
  • I think you're after `get` command to use variable made of string and variable, but I don't really get what you expect to do here... – Tensibai Mar 15 '18 at 16:51
  • Thank you, Tensibai. I need to do smth like this: for (i in c("x_1", "x_5")) { then write a code so that I have the loop over x_1 and x_5} – Yelena Mar 15 '18 at 16:56
  • Maybe something along the lines of `col = "x_1"; DT[, .SD[.(1), on=col][.(d_dn = date - 31, d_up = date), on=.(date >= d_dn, date <= d_up), .N, by=.EACHI]$N]` – Frank Mar 15 '18 at 17:06
  • @Yelena I was not asking about what technical method you want to follow, but more: what do you have as input, what do you want as output and what is in plain english the transformation (keep your current attempt in the post to show your effort). Avoid the XY problem asking how to solve an intermediate step, ask how to solve the problem itself and someone will probably give you an answer. – Tensibai Mar 15 '18 at 17:09
  • We also need to make sure that we take care of different IDs. – Yelena Mar 15 '18 at 17:20
  • @Tensibai. I edited my question. You can see the input and ouput for x_1 (as an example). I'd like to write a loop for x_1, x_5. – Yelena Mar 15 '18 at 18:03
  • @Frank. Thank you for your help. I ran your code but the results are different from what I'd like to get. What does your code calculate? col = "x_1"; df[, .SD[.(1), on=col][.(d_dn = date - 31, d_up = date), on=.(date >= d_dn, date <= d_up), .N, by=.EACHI]$N] [1] 2 2 2 3 1 – Yelena Mar 15 '18 at 18:21
  • @Frank: This code works: col = "x_1"; df[, .SD[.(1), on=col]][df[,.(id,DATE_MO1,snapshot_date)], on=. (id, date >= DATE_MO1, date < snapshot_date), .N, by = .EACHI] $N – Yelena Mar 15 '18 at 18:30
  • @Frank: But how to do the loop now? – Yelena Mar 15 '18 at 18:30
  • @Yelena kudos for listening and reviewing your question :) keep it in mind for next one to get a better reception from start – Tensibai Mar 15 '18 at 22:04

2 Answers2

3

Thank to @Frank, I've found the right path. Here is the solution:

 for (i in c(1,5)){

 col = paste0("x_",i)
df[,paste0("new_dminus_mo1_x", i)] <- df[, .SD[.(1), on=col]][df[,.(id,DATE_MO1,snapshot_date)], on=.
                              (id, date >= DATE_MO1, date < snapshot_date), .N, by = .EACHI] $N
df[,paste0("new_dminus_mo2_x",i)] <- df[, .SD[.(1), on=col]][df[,.(id,DATE_MO2,DATE_MO1)], on=.
                           (id, date >= DATE_MO2, date < DATE_MO1), .N, by = .EACHI] $N
df[,paste0("new_dminus_mo3_x",i)] <- df[, .SD[.(1), on=col]][df[,.(id,DATE_MO3,DATE_MO2)], on=.
                           (id, date >= DATE_MO3, date < DATE_MO2), .N, by = .EACHI] $N




}
Yelena
  • 189
  • 1
  • 2
  • 13
3

I would create a long-form table so that only one join needs to be done per column:

lookup = melt(DT[, lapply(0:3, function(x) snapshot_date %m-% months(x)), by=id], 
  id="id", 
  meas = list(2:4, 3:5), 
  value.name = c("d_up", "d_dn"))
lookup[, rn := rowid(variable), by=id]

cols = c("x_1", "x_5")
for (k in cols) lookup[, paste0("n_", k) := 
  DT[.(1), on=k][.SD, on=.(id, date >= d_dn, date < d_up), .N, by=.EACHI]$N][]

    id variable       d_up       d_dn rn n_x_1 n_x_5
 1:  1        1 2016-01-02 2015-12-02  1     1     0
 2:  1        1 2016-02-02 2016-01-02  2     0     1
 3:  1        1 2016-02-24 2016-01-24  3     1     2
 4:  1        1 2016-03-05 2016-02-05  4     1     1
 5:  1        1 2016-04-02 2016-03-02  5     0     1
 6:  2        1 2016-01-02 2015-12-02  1     1     1
 7:  2        1 2016-02-02 2016-01-02  2     1     0
 8:  2        1 2016-02-24 2016-01-24  3     1     0
 9:  2        1 2016-03-05 2016-02-05  4     0     0
10:  2        1 2016-04-02 2016-03-02  5     1     1
11:  1        2 2015-12-02 2015-11-02  1     0     0
12:  1        2 2016-01-02 2015-12-02  2     1     0
13:  1        2 2016-01-24 2015-12-24  3     1     0
14:  1        2 2016-02-05 2016-01-05  4     0     1
15:  1        2 2016-03-02 2016-02-02  5     1     1
16:  2        2 2015-12-02 2015-11-02  1     0     0
17:  2        2 2016-01-02 2015-12-02  2     1     1
18:  2        2 2016-01-24 2015-12-24  3     1     1
19:  2        2 2016-02-05 2016-01-05  4     1     0
20:  2        2 2016-03-02 2016-02-02  5     0     0
21:  1        3 2015-11-02 2015-10-02  1     0     0
22:  1        3 2015-12-02 2015-11-02  2     0     0
23:  1        3 2015-12-24 2015-11-24  3     0     0
24:  1        3 2016-01-05 2015-12-05  4     1     0
25:  1        3 2016-02-02 2016-01-02  5     0     1
26:  2        3 2015-11-02 2015-10-02  1     0     0
27:  2        3 2015-12-02 2015-11-02  2     0     0
28:  2        3 2015-12-24 2015-11-24  3     0     0
29:  2        3 2016-01-05 2015-12-05  4     1     1
30:  2        3 2016-02-02 2016-01-02  5     1     0
    id variable       d_up       d_dn rn n_x_1 n_x_5

meas = list(2:4, 3:5) just "melts" columns 2:4 together into one column, and similarly for 3:5.

Long form also has the benefit that you don't need to spend too much time working on naming conventions for columns containing similar data ("DATE_MO[x]", "new_dminus_mo[x]_x", etc).

I prefer this format (with separate long-form tables), but an "update join" can recover columns from DT here (with repeating values):

DT[, rn := rowid(id)]
DTcols = setdiff(names(DT), names(lookup))
lookup[DT, on=.(id, rn), (DTcols) := mget(paste0("i.", DTcols))]

    id variable       d_up       d_dn rn n_x_1 n_x_5       date snapshot_date x_1 x_5
 1:  1        1 2016-01-02 2015-12-02  1     1     0 2016-01-01    2016-01-02   1   0
 2:  1        1 2016-02-02 2016-01-02  2     0     1 2016-02-01    2016-02-02   0   1
 3:  1        1 2016-02-24 2016-01-24  3     1     2 2016-02-23    2016-02-24   1   1
 4:  1        1 2016-03-05 2016-02-05  4     1     1 2016-03-04    2016-03-05   0   0
 5:  1        1 2016-04-02 2016-03-02  5     0     1 2016-04-01    2016-04-02   0   1
 6:  2        1 2016-01-02 2015-12-02  1     1     1 2016-01-01    2016-01-02   1   1
 7:  2        1 2016-02-02 2016-01-02  2     1     0 2016-02-01    2016-02-02   1   0
 8:  2        1 2016-02-24 2016-01-24  3     1     0 2016-02-23    2016-02-24   0   0
 9:  2        1 2016-03-05 2016-02-05  4     0     0 2016-03-04    2016-03-05   0   0
10:  2        1 2016-04-02 2016-03-02  5     1     1 2016-04-01    2016-04-02   1   1
11:  1        2 2015-12-02 2015-11-02  1     0     0 2016-01-01    2016-01-02   1   0
12:  1        2 2016-01-02 2015-12-02  2     1     0 2016-02-01    2016-02-02   0   1
13:  1        2 2016-01-24 2015-12-24  3     1     0 2016-02-23    2016-02-24   1   1
14:  1        2 2016-02-05 2016-01-05  4     0     1 2016-03-04    2016-03-05   0   0
15:  1        2 2016-03-02 2016-02-02  5     1     1 2016-04-01    2016-04-02   0   1
16:  2        2 2015-12-02 2015-11-02  1     0     0 2016-01-01    2016-01-02   1   1
17:  2        2 2016-01-02 2015-12-02  2     1     1 2016-02-01    2016-02-02   1   0
18:  2        2 2016-01-24 2015-12-24  3     1     1 2016-02-23    2016-02-24   0   0
19:  2        2 2016-02-05 2016-01-05  4     1     0 2016-03-04    2016-03-05   0   0
20:  2        2 2016-03-02 2016-02-02  5     0     0 2016-04-01    2016-04-02   1   1
21:  1        3 2015-11-02 2015-10-02  1     0     0 2016-01-01    2016-01-02   1   0
22:  1        3 2015-12-02 2015-11-02  2     0     0 2016-02-01    2016-02-02   0   1
23:  1        3 2015-12-24 2015-11-24  3     0     0 2016-02-23    2016-02-24   1   1
24:  1        3 2016-01-05 2015-12-05  4     1     0 2016-03-04    2016-03-05   0   0
25:  1        3 2016-02-02 2016-01-02  5     0     1 2016-04-01    2016-04-02   0   1
26:  2        3 2015-11-02 2015-10-02  1     0     0 2016-01-01    2016-01-02   1   1
27:  2        3 2015-12-02 2015-11-02  2     0     0 2016-02-01    2016-02-02   1   0
28:  2        3 2015-12-24 2015-11-24  3     0     0 2016-02-23    2016-02-24   0   0
29:  2        3 2016-01-05 2015-12-05  4     1     1 2016-03-04    2016-03-05   0   0
30:  2        3 2016-02-02 2016-01-02  5     1     0 2016-04-01    2016-04-02   1   1
    id variable       d_up       d_dn rn n_x_1 n_x_5       date snapshot_date x_1 x_5

Or reshape it to wide format and update-join back to DT:

wDT = dcast(lookup, id + rn ~ variable, value.var = paste0("n_", cols))

    id rn n_x_1_1 n_x_1_2 n_x_1_3 n_x_5_1 n_x_5_2 n_x_5_3
 1:  1  1       1       0       0       0       0       0
 2:  1  2       0       1       0       1       0       0
 3:  1  3       1       1       0       2       0       0
 4:  1  4       1       0       1       1       1       0
 5:  1  5       0       1       0       1       1       1
 6:  2  1       1       0       0       1       0       0
 7:  2  2       1       1       0       0       1       0
 8:  2  3       1       1       0       0       1       0
 9:  2  4       0       1       1       0       0       1
10:  2  5       1       0       1       1       0       0


DT[, rn := rowid(id)]
wDTcols = setdiff(names(wDT), names(DT))
DT[wDT, on=.(id, rn), (wDTcols) := mget(paste0("i.", wDTcols))]

    id       date snapshot_date x_1 x_5 rn n_x_1_1 n_x_1_2 n_x_1_3 n_x_5_1 n_x_5_2 n_x_5_3
 1:  1 2016-01-01    2016-01-02   1   0  1       1       0       0       0       0       0
 2:  1 2016-02-01    2016-02-02   0   1  2       0       1       0       1       0       0
 3:  1 2016-02-23    2016-02-24   1   1  3       1       1       0       2       0       0
 4:  1 2016-03-04    2016-03-05   0   0  4       1       0       1       1       1       0
 5:  1 2016-04-01    2016-04-02   0   1  5       0       1       0       1       1       1
 6:  2 2016-01-01    2016-01-02   1   1  1       1       0       0       1       0       0
 7:  2 2016-02-01    2016-02-02   1   0  2       1       1       0       0       1       0
 8:  2 2016-02-23    2016-02-24   0   0  3       1       1       0       0       1       0
 9:  2 2016-03-04    2016-03-05   0   0  4       0       1       1       0       0       1
10:  2 2016-04-01    2016-04-02   1   1  5       1       0       1       1       0       0
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Hi Frank, This is excellent! Thank you very much! And in dcast() function, if I need to bring back all other variables from my original dataset, what should I do? – Yelena Mar 15 '18 at 19:12
  • And what does meas=list(2:4,3:5) mean? – Yelena Mar 15 '18 at 19:33
  • @Yelena Good questions. I've made some edits; hope it makes sense. I think it's often better to work with multiple tables (rather than joining back to have multiple related columns with names you'll need to parse; or having repeating values). On that general topic, if you're interested, Hadley's tidy data paper is good: https://www.jstatsoft.org/article/view/v059i10 – Frank Mar 15 '18 at 20:08
  • 1
    @ Frank. You're awesome! You helped me SO much. I'll take a look at the paper you suggested. Thank you again! – Yelena Mar 15 '18 at 20:32