I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users. The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.
Each row is therefore a unqiue combination of user_id, time and a set of variables.
Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days. So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.
I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.
So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.
After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!
My df basically looks like :
user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)
- My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:
df <- df %>% group_by(user_id) %>% mutate(new_x1=cumsum(v1==1))
What I need: That variables only counting over the previoues 30 days per user
- Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.
What I achieved so far:
df <- df %>%
group_by(user_id, category_var) %>%
mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))
I also need this based on the previous 30 days and not the whole observation period per user.
- My third variable of interest (new__x3): The time between two observations given a certain condition (v1==1)
#Interevent Time
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))
I would also need this on the previoues 30 days.
Thank you so much!
Edit after John Springs Post:
My potential solution would then be
setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]),
new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
by = eval(c("user_id", "time"))]
I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.
Any suggestions?