R: calculate number of distinct categories in the specified time frame

Question

here's some dummy data:

  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple

I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order

The solution would look like this:

 user_id       date category distinct_7 distinct_14
      27 2016-01-01    apple          1           1
      27 2016-01-03    apple          1           1
      27 2016-01-05     pear          2           2
      27 2016-01-07     plum          3           3
      27 2016-01-10    apple          3           3
      27 2016-01-14     pear          3           3
      27 2016-01-16     plum          3           3
      11 2016-01-01    apple          1           1
      11 2016-01-03     pear          2           2
      11 2016-01-05     pear          2           2
      11 2016-01-07     pear          2           2
      11 2016-01-10    apple          2           2
      11 2016-01-14    apple          2           2
      11 2016-01-16    apple          1           2

I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!

Are you sure the values in `distinct_7` are correct? If I look at 2016-01-10, should it start as a new group. Also, if you look at the value of `distinct_7` for `user_id` 11, it starts at 0. — akrun, Jan 17 '17 at 09:21
in `distinct_7`, between `2016-01-10` and `2016-01-03` there were in total 3 categories for `user 27` and 2 for `user 11`. Does it make sense now? — Kasia Kulma, Jan 17 '17 at 09:25
You can iterate over `date`, i.e. `library(tidyverse); df %>% group_by(user_id) %>% mutate(distinct_7 = map_int(date, ~n_distinct(category[date >= .x - 7 & date <= .x])), distinct_14 = map_int(date, ~n_distinct(category[date >= .x - 14 & date <= .x])))`, though I think there's probably a more elegant way to do this with `zoo::rollapply` or the like. — alistaire, Jan 17 '17 at 09:40
[related](http://stackoverflow.com/questions/41593453/r-calculate-the-number-of-occurrences-of-a-specific-event-in-a-specified-time-f/41594675#41594675) — Axeman, Jan 17 '17 at 11:58
@alistaire: this works very well for me, although it takes time to process. Still, if you write an answer, I'll happily accept it. — Kasia Kulma, Jan 17 '17 at 14:18
@alistaire: I just tried to use your solution to calculate cumulative sums of numerical values, e.g.: `order_value_minus7 = map_int(order_date, ~cumsum(order_value[order_date >= .x - 7 & order_date <= .x]))`, but it returns `Error: object 'order_value' not found`. Why can't the same approach work universally, or can it? Thanks again for your help — Kasia Kulma, Jan 17 '17 at 14:42
Fwiw, the data.table analogue of `n_distinct` (in alistaire's comment-answer) is `uniqueN`. — Frank, Jan 17 '17 at 17:32
You don't have a variable called `order_value` in what you show above. I'm not posting an answer because the data doesn't match your desired result, I believe because the desired result is miscalculated, if I understand what you want. Also, I'm pretty sure there's a better way to do it, anyway. — alistaire, Jan 18 '17 at 01:59
@alistaire: sure, that's absolutely understandable. I asked about `order_value` because your solution looked pretty universal and I thought it would require only a small tweak to use it for numerical variable. Anyway, why do you think that the desired solution is miscalculated? — Kasia Kulma, Jan 18 '17 at 08:25

GoGonzo · Answer 1 · 2021-03-23T08:51:17.687

I recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):

df <- read.table(
  text = "  user_id       date category
       27 2016-01-01    apple
  27 2016-01-03    apple
  27 2016-01-05     pear
  27 2016-01-07     plum
  27 2016-01-10    apple
  27 2016-01-14     pear
  27 2016-01-16     plum
  11 2016-01-01    apple
  11 2016-01-03     pear
  11 2016-01-05     pear
  11 2016-01-07     pear
  11 2016-01-10    apple
  11 2016-01-14    apple
  11 2016-01-16    apple", header = TRUE, colClasses = c("integer", "Date", "character"))



library(dplyr)
library(runner)
df %>%
  group_by(user_id) %>%
  mutate(distinct_7  = runner(category, k = 7 + 1, idx = date, 
                              f = function(x) length(unique(x))),
         distinct_14 = runner(category, k = 14 + 1, idx = date, 
                              f = function(x) length(unique(x))))

More informations in package and function documentation.

score 3 · Answer 2 · edited Jan 18 '21 at 12:37

Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.

The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although @alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).

It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.

library(data.table)
tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) rbindlist(
           lapply(unique(dt$date), 
                  function(ldate) {
                    dt[between(date, ldate - ldays, ldate), 
                       .(distinct = sprintf("distinct_%02i", ldays), 
                         date = ldate, 
                         N = uniqueN(category)), 
                       by = .(user_id)]
                  })
         )
  )
)
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 
#          date user_id category distinct_07 distinct_14
# 1: 2016-01-01      27    apple           1           1
# 2: 2016-01-03      27    apple           1           1
# 3: 2016-01-05      27     pear           2           2
# 4: 2016-01-07      27     plum           3           3
# 5: 2016-01-10      27    apple           3           3
# 6: 2016-01-14      27     pear           3           3
# 7: 2016-01-16      27     plum           3           3
# 8: 2016-01-01      11    apple           1           1
# 9: 2016-01-03      11     pear           2           2
#10: 2016-01-05      11     pear           2           2
#11: 2016-01-07      11     pear           2           2
#12: 2016-01-10      11    apple           2           2
#13: 2016-01-14      11    apple           2           2
#14: 2016-01-16      11    apple           1           2

Here is a variant following a suggestion by @Frank which uses data.table's non-equi joins instead of the second lapply:

tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) {
           dt[.(user_id = user_id, dago = date - ldays, d = date), 
              on=.(user_id, date >= dago, date <= d), 
              .(distinct = sprintf("distinct_%02i", ldays),
                N = uniqueN(category)), 
              by = .EACHI]
         }
  )
)[, date := NULL]
# 
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)]

Data:

dt <- fread("user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple")
dt[, date := as.IDate(date)]

BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.

score 1 · Accepted Answer · answered Jan 20 '17 at 00:53

In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.

library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2

R: calculate number of distinct categories in the specified time frame

3 Answers3

Linked