R: Calculating the number of occurrences within a specific time period in the past for each unique individual in a dataset in R

Question

I'm attempting to tally the number of times an event occurred for a given individual within a specific period of past time. In this particular case, I need to know, for each new observation (which reflects a single scheduling request), how many times the individual has scheduled a trip during the preceding 60 days (trip_scheduled). Eventually I will need to tally the number of times that person cancelled on the same day as the scheduled trip for the preceding 60 days. But I'm starting with just the tally in the "moving" 60-day period.

I found some elegant answers to a similar but slightly different problem in this post: R: calculate the number of occurrences of a specific event in a specified time future

My situation differs in a few ways: First, I'm trying to look at a previous time period, and I don't know if that will change my approach, and, two, I need to do the analysis for more than 40,000 individuals, which I've been trying to accomplish through a mix of the code I found in the other answer, a for loop (which I know is frowned upon) and dplyr grouping. It isn't working at all.

Would anyone be able to help point me in the right direction? I'd love to stick to dplyr and base. I just don't know much about data.table.

This is the code and test data I've been trying to noodle on:

test_set2 <- structure(list(tripID = c("20180112-100037-674-101", "20180112-100037-674-201", 
                                       "20180112-100037-674-301", "20180113-100037-676-101", "20180113-100037-676-201", 
                                       "20180115-100037-675-101", "20180115-100037-675-201", "20180116-100037-677-101", 
                                       "20180116-100037-677-201", "20180131-100037-678-101", "20180101-100146-707-101", 
                                       "20180101-100146-707-201", "20180102-100146-708-101", "20180102-100146-708-201", 
                                       "20180103-100146-709-101", "20180103-100146-709-201", "20180104-100146-710-101", 
                                       "20180104-100146-710-201", "20180105-100146-711-101", "20180105-100146-711-201", 
                                       "20180403-100532-223-101", "20180403-100532-223-201", "20180620-100532-224-101", 
                                       "20180620-100532-224-201", "20180704-100532-225-101", "20180704-100532-225-201", 
                                       "20180926-100532-228-101", "20180926-100532-228-201", "20180927-100532-226-101", 
                                       "20180927-100532-226-201"), CUSTOMER_ID = c(100037L, 100037L, 
                                                                                   100037L, 100037L, 100037L, 100037L, 100037L, 100037L, 100037L, 
                                                                                   100037L, 100146L, 100146L, 100146L, 100146L, 100146L, 100146L, 
                                                                                   100146L, 100146L, 100146L, 100146L, 100532L, 100532L, 100532L, 
                                                                                   100532L, 100532L, 100532L, 100532L, 100532L, 100532L, 100532L
                                       ), trip_date = structure(c(17543, 17543, 17543, 17544, 17544, 
                                                                  17546, 17546, 17547, 17547, 17562, 17532, 17532, 17533, 17533, 
                                                                  17534, 17534, 17535, 17535, 17536, 17536, 17624, 17624, 17702, 
                                                                  17702, 17716, 17716, 17800, 17800, 17801, 17801), class = "Date"), 
                            trip_scheduled = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), same_day_cancel = c(1, 
                                                                                                                       1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -30L), groups = structure(list(
                                                                                                                         CUSTOMER_ID = c(100037L, 100146L, 100532L), .rows = list(
                                                                                                                           1:10, 11:20, 21:30)), row.names = c(NA, -3L), class = c("tbl_df", 
                                                                                                                                                                                   "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
                                                                                                                                                                                                                                  "tbl_df", "tbl", "data.frame"))

running_frame <- test_set2[1,]

unique_customers <- unique(test_set2$CUSTOMER_ID)

for (cust in unique_customers){
  temp_events <- test_set2 %>% filter(CUSTOMER_ID == i)
  cs = cumsum(temp_events$trip_scheduled) # cumulative number of trips of individual
  output_temp <- data.frame(temp_events, 
                            trips_minus_60 = cs[findInterval(temp_events$trip_date - 60, temp_events$trip_date, left.open = TRUE)] - cs)
  new_table <- rbind(new_table,output_temp)

}

This is the error I generated most recently:

Error in data.frame(temp_events, trips_minus_60 = cs[findInterval(temp_events$trip_date - : arguments imply differing number of rows: 10, 0

score 1 · Accepted Answer · answered Feb 07 '20 at 20:43

I'm not sure this meets your needs, but this is based on @Axeman's tidyverse solution you linked to. After group_by your CUSTOMER_ID you can sum all rows with trip_scheduled is 1 and dates fall between current date and 60 days prior. I would expect you could do something similar for same_day_cancel as well.

library(tidyverse)

test_set2 %>% 
  group_by(CUSTOMER_ID) %>%
    mutate(schedule_60 = unlist(map(trip_date, ~sum(trip_scheduled == 1 & between(trip_date, . - 60, .))))) %>%
  print(n=30)

# A tibble: 30 x 6
# Groups:   CUSTOMER_ID [3]
   tripID                  CUSTOMER_ID trip_date  trip_scheduled same_day_cancel schedule_60
   <chr>                         <int> <date>              <dbl>           <dbl>       <int>
 1 20180112-100037-674-101      100037 2018-01-12              1               1           3
 2 20180112-100037-674-201      100037 2018-01-12              1               1           3
 3 20180112-100037-674-301      100037 2018-01-12              1               1           3
 4 20180113-100037-676-101      100037 2018-01-13              1               0           5
 5 20180113-100037-676-201      100037 2018-01-13              1               0           5
 6 20180115-100037-675-101      100037 2018-01-15              1               1           7
 7 20180115-100037-675-201      100037 2018-01-15              1               1           7
 8 20180116-100037-677-101      100037 2018-01-16              1               0           9
 9 20180116-100037-677-201      100037 2018-01-16              1               0           9
10 20180131-100037-678-101      100037 2018-01-31              1               0          10
11 20180101-100146-707-101      100146 2018-01-01              1               1           2
12 20180101-100146-707-201      100146 2018-01-01              1               1           2
13 20180102-100146-708-101      100146 2018-01-02              1               1           4
14 20180102-100146-708-201      100146 2018-01-02              1               1           4
15 20180103-100146-709-101      100146 2018-01-03              1               1           6
16 20180103-100146-709-201      100146 2018-01-03              1               1           6
17 20180104-100146-710-101      100146 2018-01-04              1               1           8
18 20180104-100146-710-201      100146 2018-01-04              1               1           8
19 20180105-100146-711-101      100146 2018-01-05              1               1          10
20 20180105-100146-711-201      100146 2018-01-05              1               1          10
21 20180403-100532-223-101      100532 2018-04-03              1               0           2
22 20180403-100532-223-201      100532 2018-04-03              1               0           2
23 20180620-100532-224-101      100532 2018-06-20              1               0           2
24 20180620-100532-224-201      100532 2018-06-20              1               0           2
25 20180704-100532-225-101      100532 2018-07-04              1               0           4
26 20180704-100532-225-201      100532 2018-07-04              1               0           4
27 20180926-100532-228-101      100532 2018-09-26              1               0           2
28 20180926-100532-228-201      100532 2018-09-26              1               0           2
29 20180927-100532-226-101      100532 2018-09-27              1               0           4
30 20180927-100532-226-201      100532 2018-09-27              1               0           4

This is so concise (and fast). I think it gets me almost to what I need. Do you have any idea how I could work it so that I wouldn't include anything from the date of the observation? For example, for rows 1-3, I'd want a value of 0, since there's nothing before that. For rows 4-5, the value would be 3 (just the trips on Jan. 12). And so on? I'm guessing it would be something like tallying the sum for just the day and subtracting it, but I'm struggling to figure out how. (It's been a long day). — James DeWeese, Feb 07 '20 at 21:21
try changing `between(trip_date, . - 60, .)` to `between(trip_date, . - 60, . - 1)` — Ben, Feb 07 '20 at 21:23
I tried this (and it worked) but I don't actually understand why ... test_set2 %>% group_by(CUSTOMER_ID) %>% mutate(schedule_60 = unlist(map(trip_date, ~sum(trip_scheduled == 1 & between((trip_date+1), . - 60, .))))) %>% print(n=30) — James DeWeese, Feb 07 '20 at 21:24
if you subtract 1 from the second date (`. - 1`) then you would exclude the same date of the observation. — Ben, Feb 07 '20 at 21:26
That worked! You're a lifesaver. I'm going to read up a bit more on the syntax so I understand better why it works. Thanks again! — James DeWeese, Feb 07 '20 at 21:28

R: Calculating the number of occurrences within a specific time period in the past for each unique individual in a dataset in R

1 Answers1