0

I have a data frame that consists of one row for each user. For each user, I have information on when they first logged on to a website (row 1), and for each day after that (each day is a column) how many times per day they logged in.

I would like to create a new column that tells me the number of times, within the 7 days after they first logged on, a user logged on to a website. This 7-day period is different for every user.

I would like to do a for-loop, but don't know how to make changes to include varying column names for that summation.

tibble(id=c(1:4), first_log = c("18-12-01", "18-12-02", "18-12-02",
"18-12-05"), X18_12_01 = c(NA,1,1,2), X18_12_02 = c(5,2,1,1))

Of course, the data set has many more columns and rows just like that.

Nina
  • 91
  • 7

2 Answers2

0

With any number of date columns after first_login, we can do the following.

  1. gather the date columns into a single column so we can actually operate on the conveniently. This is a tidy format, with each row as a user-day combination.
  2. Convert the dates into Date objects, and figure out the days between the observation and the first_log. We use an Interval to do this to ensure sensible behaviour around timeline irregularities.
  3. filter to keep only the observations which occurred in the desired week range
  4. summarise to calculate the total logins in this period, and right_join the values back onto the original. Note that without more sample data it's hard to be more illustrative, since only one cell actually gets summed here, and so only user with id = 1 actually has any values to sum.
library(tidyverse)
library(lubridate)

tbl <- tibble(id = c(1:4), first_log = c("18-12-01", "18-12-02", "18-12-02", "18-12-05"), X18_12_01 = c(NA, 1, 1, 2), X18_12_02 = c(5, 2, 1, 1))

tbl %>%
  gather(day, num_logins, -id, -first_log) %>%
  mutate(
    first_log = ymd(first_log),
    day = day %>% str_remove("^X") %>% ymd(),
    days_since_event = as.period(first_log %--% day, "day"),
  ) %>%
  filter(days_since_event > days(0) & days_since_event <= days(7)) %>%
  group_by(id) %>%
  summarise(total_logins = sum(num_logins, na.rm = TRUE)) %>%
  right_join(tbl, by = "id")
#> # A tibble: 4 x 5
#>      id total_logins first_log X18_12_01 X18_12_02
#>   <int>        <dbl> <chr>         <dbl>     <dbl>
#> 1     1            5 18-12-01         NA         5
#> 2     2           NA 18-12-02          1         2
#> 3     3           NA 18-12-02          1         1
#> 4     4           NA 18-12-05          2         1

Created on 2019-03-04 by the reprex package (v0.2.1)

Calum You
  • 14,687
  • 4
  • 23
  • 42
-1

Assuming that you only have 7 columns after first_log then this will sum the total amount of times logged in within the span of time you have across columns. In the below example, it only shows total times user logged in over the course of 2 days.

library(tidyverse)


tibble(id=c(1:4), 
       first_log = c("18-12-01", 
                     "18-12-02", 
                     "18-12-02",
                     "18-12-05"), 
       X18_12_01 = c(NA,1,1,2), 
       X18_12_02 = c(5,2,1,1)) %>% 
  gather(key = "days", value = "times_visted", 3:length(.)) %>% 
  group_by(id) %>% 
  summarise(total_visits = sum(times_visted, na.rm = T))
#> # A tibble: 4 x 2
#>      id total_visits
#>   <int>        <dbl>
#> 1     1            5
#> 2     2            3
#> 3     3            2
#> 4     4            3

Created on 2019-03-04 by the reprex package (v0.2.1)

dylanjm
  • 2,011
  • 9
  • 21