1

I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users. The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.

Each row is therefore a unqiue combination of user_id, time and a set of variables.

Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days. So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.

I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.

So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.

After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!

My df basically looks like :

user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)

  1. My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:

df <- df %>%  group_by(user_id) %>% mutate(new_x1=cumsum(v1==1)) 

What I need: That variables only counting over the previoues 30 days per user

  1. Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.

What I achieved so far:

df <- df %>% 
group_by(user_id, category_var) %>%
 mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))

I also need this based on the previous 30 days and not the whole observation period per user.

  1. My third variable of interest (new__x3): The time between two observations given a certain condition (v1==1)
#Interevent Time 
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))


I would also need this on the previoues 30 days.

Thank you so much!

Edit after John Springs Post:

My potential solution would then be

setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]), 
                         new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
                  by = eval(c("user_id", "time"))]

I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.

Any suggestions?

ReLa
  • 33
  • 3
  • Have you calculated a time difference between each user visit? you can do this by group_by(user_id) %>% arrange(date) %>% mutate(daysbetween = as.integer(difftime(date, lag(date), units = "days")) %>% ungroup(). You can then cumsum row wise on that column and filterby rows <30? – hisspott Apr 23 '19 at 22:48
  • For so many separate groups, I expect data.table will be much faster and crisper than dplyr. I don't know a dplyr solution that wouldn't be unwieldy for that much data. For smaller data you might use `padr::pad` to fill in all the days, and then use cumsum and lag to get the 30d avg. https://stackoverflow.com/questions/41719929/rolling-sums-for-groups-with-uneven-time-gaps – Jon Spring Apr 24 '19 at 00:01
  • @hisspott: Wouldnt I then need the difftime of each row to each previous other row in the data set not only between 2 consecutive rows? I do want to calculate over all rows within the previous 30 days. – ReLa Apr 24 '19 at 08:10
  • @Jon Spring Thanks John. So I (still) so far really unfamiliar with data.table but I do see the need here. I will post a suggest solution. – ReLa Apr 24 '19 at 08:16
  • make your code reproducible – jangorecki Oct 22 '22 at 20:40

0 Answers0