I'm attempting to tally the number of times an event occurred for a given individual within a specific period of past time. In this particular case, I need to know, for each new observation (which reflects a single scheduling request), how many times the individual has scheduled a trip during the preceding 60 days (trip_scheduled). Eventually I will need to tally the number of times that person cancelled on the same day as the scheduled trip for the preceding 60 days. But I'm starting with just the tally in the "moving" 60-day period.
I found some elegant answers to a similar but slightly different problem in this post: R: calculate the number of occurrences of a specific event in a specified time future
My situation differs in a few ways: First, I'm trying to look at a previous time period, and I don't know if that will change my approach, and, two, I need to do the analysis for more than 40,000 individuals, which I've been trying to accomplish through a mix of the code I found in the other answer, a for loop (which I know is frowned upon) and dplyr grouping. It isn't working at all.
Would anyone be able to help point me in the right direction? I'd love to stick to dplyr and base. I just don't know much about data.table.
This is the code and test data I've been trying to noodle on:
test_set2 <- structure(list(tripID = c("20180112-100037-674-101", "20180112-100037-674-201",
"20180112-100037-674-301", "20180113-100037-676-101", "20180113-100037-676-201",
"20180115-100037-675-101", "20180115-100037-675-201", "20180116-100037-677-101",
"20180116-100037-677-201", "20180131-100037-678-101", "20180101-100146-707-101",
"20180101-100146-707-201", "20180102-100146-708-101", "20180102-100146-708-201",
"20180103-100146-709-101", "20180103-100146-709-201", "20180104-100146-710-101",
"20180104-100146-710-201", "20180105-100146-711-101", "20180105-100146-711-201",
"20180403-100532-223-101", "20180403-100532-223-201", "20180620-100532-224-101",
"20180620-100532-224-201", "20180704-100532-225-101", "20180704-100532-225-201",
"20180926-100532-228-101", "20180926-100532-228-201", "20180927-100532-226-101",
"20180927-100532-226-201"), CUSTOMER_ID = c(100037L, 100037L,
100037L, 100037L, 100037L, 100037L, 100037L, 100037L, 100037L,
100037L, 100146L, 100146L, 100146L, 100146L, 100146L, 100146L,
100146L, 100146L, 100146L, 100146L, 100532L, 100532L, 100532L,
100532L, 100532L, 100532L, 100532L, 100532L, 100532L, 100532L
), trip_date = structure(c(17543, 17543, 17543, 17544, 17544,
17546, 17546, 17547, 17547, 17562, 17532, 17532, 17533, 17533,
17534, 17534, 17535, 17535, 17536, 17536, 17624, 17624, 17702,
17702, 17716, 17716, 17800, 17800, 17801, 17801), class = "Date"),
trip_scheduled = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), same_day_cancel = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -30L), groups = structure(list(
CUSTOMER_ID = c(100037L, 100146L, 100532L), .rows = list(
1:10, 11:20, 21:30)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
running_frame <- test_set2[1,]
unique_customers <- unique(test_set2$CUSTOMER_ID)
for (cust in unique_customers){
temp_events <- test_set2 %>% filter(CUSTOMER_ID == i)
cs = cumsum(temp_events$trip_scheduled) # cumulative number of trips of individual
output_temp <- data.frame(temp_events,
trips_minus_60 = cs[findInterval(temp_events$trip_date - 60, temp_events$trip_date, left.open = TRUE)] - cs)
new_table <- rbind(new_table,output_temp)
}
This is the error I generated most recently:
Error in data.frame(temp_events, trips_minus_60 = cs[findInterval(temp_events$trip_date - : arguments imply differing number of rows: 10, 0