0

I have a dateset like this dummy sample. It contains two months data for digital user activity.

df <- tibble::tribble(
     ~date,      ~user_id, ~app_id, ~total_usage,
  20190701, "18120439-aa",     383,          223,
  20190702, "18120439-aa",     383,          147,
  20190701, "18120439-ab",     382,           10,
  20190701, "18120439-ab",     383,          395,
  20190702, "18120439-ab",     383,          495,
  20190703, "18120439-ab",     383,          347,
  20190807, "18160102-aa",     262,            6,
  20190808, "18160102-aa",     262,           24,
  20190711, "18160102-aa",     382,            3,
  20190725, "18160102-aa",     382,           11,
  20190727, "18160102-aa",     382,           48,
  20190702, "18160102-aa",     383,            6,
  20190705, "18160102-aa",     383,           42,
  20190706, "18160102-aa",     383,          104,
  20190708, "18160102-aa",     383,           12,
  20190709, "18160102-aa",     383,           13,
  20190710, "18160102-aa",     383,          114,
  20190712, "18160102-aa",     383,          187,
  20190713, "18160102-aa",     383,           37,
  20190715, "18160102-aa",     383,           58,
  20190716, "18160102-aa",     383,           40,
  20190717, "18160102-aa",     383,           40,
  20190718, "18160102-aa",     383,           35,
  20190719, "18160102-aa",     383,           19,
  20190720, "18160102-aa",     383,           63,
  20190723, "18160102-aa",     383,            2,
  20190726, "18160102-aa",     383,           69,
  20190729, "18160102-aa",     383,           31,
  20190730, "18160102-aa",     383,           26,
  20190731, "18160102-aa",     383,           41,
  20190802, "18160102-aa",     383,           50,
  20190805, "18160102-aa",     383,           34,
  20190806, "18160102-aa",     383,           23,
  20190807, "18160102-aa",     383,           14
  )

First Task

I made a tsibble object with the key=c(user_id, app_id) and date as a index. First I tried to get the average usage per user per app_id in the whole period. The user can be online any number of days , 1 day, 10 days, 30 or every day == 60 days. I wanted to calculate rollapply mean based on user number of activity day.

example : user 18160102-aa were only active two days and used app_id 262 , first day use 6 minutes, second day 24 so avg usage for this user during the whole period for app_id = 262 is 15 minutes.

I used tsibble::tile_dbl to calculate average of window size 60 ( 2 months = 60 days) for the entire period and then convert it back to tibble and remove duplicate row. like this :

library(tidyverse)
library(tsibble)

df %>% 
#create a tsibble object
  as_tsibble(key = c(user_id, app_id), index = date) %>% 
  group_by_key() %>%
# calculate average during the activity period
  mutate(Avg_period = tile_dbl(total_usage, ~ mean(., na.rm = TRUE), .size = 60)) %>% 
  as_tibble() %>%
  select(-total_usage, -date) %>% 
  distinct()

The output seems correct however I am wondering if there is a better way to do this without making a duplicate?

Second Task

How can I set window size if I want to calculate average and rolling up usage for last 7 days, 14 and 21 day from a specific days?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
DanG
  • 689
  • 1
  • 16
  • 39

1 Answers1

2

This is my take on the first question. We don't need a tsibble object to perform this task. tile_dbl() creates a temporary grouping variable date60 and use group_by() + summarise() to compute the averages.

library(tidyverse)
df %>% 
  group_by(user_id, app_id) %>% 
  mutate(date60 = tsibble::tile_dbl(date, ~ .[1], .size = 60)) %>% 
  group_by(date60, add = TRUE) %>% 
  summarise(avg_period = mean(total_usage, na.rm = TRUE))
#> # A tibble: 6 x 4
#> # Groups:   user_id, app_id [6]
#>   user_id     app_id   date60 avg_period
#>   <chr>        <dbl>    <dbl>      <dbl>
#> 1 18120439-aa    383 20190701      185  
#> 2 18120439-ab    382 20190701       10  
#> 3 18120439-ab    383 20190701      412. 
#> 4 18160102-aa    262 20190807       15  
#> 5 18160102-aa    382 20190711       20.7
#> 6 18160102-aa    383 20190702       46.1

Created on 2019-10-11 by the reprex package (v0.3.0)

Regarding the second question, I'd suggest to look at the slide package merely for rolling window calculation, including days from a specific days with slide_index(). Tsibble will gradually deprecate the rolling window functions in favour of slide. You may like to look for alternative solutions using slide.

Earo Wang
  • 789
  • 5
  • 8