Running count of rows between dates in R

Question

I have a dataset which looks like this:

date, brandname, status, case number
2017-01-01, x1, closed, 12345
2017-01-01, x2, closed, 12345
2017-01-01, x3, closed, 12345
2017-01-02, x4, open, 7864
2017-01-03, x5, open, 78642
...

There can be double dates and there can be gaps between dates (missing days)

Per day I want to count the number of rows (cases) that have occurred in the past 30 days.

I succeeded in doing this the following way:

dataframe <- 
  structure(list(date = structure(c(17167, 17167, 17167, 17168, 
  17169), class = "Date"), brandname = c("x1", "x2", "x3", "x4", 
  "x5"), status = c("closed", "closed", "closed", "open", "open"
  ), `case number` = c(12345L, 12345L, 12345L, 7864L, 78642L)), .Names = c("date", 
  "brandname", "status", "case number"), row.names = c(NA, -5L), class = "data.frame")

for(i in 1:nrow(dataframe)) {
   frame <- subset(dataframe, date > dataframe$date[i] - 30L & 
                              date < dataframe$date[i])
   dataframe$count[i] <- nrow(frame)
}

This gives me the exact result I need:

dataframe
        date brandname status case number count
1 2017-01-01        x1 closed       12345     0
2 2017-01-01        x2 closed       12345     0
3 2017-01-01        x3 closed       12345     0
4 2017-01-02        x4   open        7864     3
5 2017-01-03        x5   open       78642     4

But I am sure there is a more efficient way of doing this. Can anyone help?

"Per day I want to count the number of rows (cases) that have occurred in the past 30 days." This doesn't quite make sense to me. Do you mean the last 30 _records_ per day? Assuming that each record occurred at a different time throughout the day and that they are in chronological order (the time component is missing). — acylam, Sep 28 '17 at 21:02
Something like this? `ave(rep(1, length(dataframe$date)), dataframe$date, FUN = cumsum)`. — Rui Barradas, Sep 28 '17 at 21:11
Franck wants rolling sum of frequencies with previous 30-days window for each day. — Kota Mori, Sep 28 '17 at 21:44
Thanks Kota Mori. I didn't know that what I am doing is called a rolling sum of frequencies. Using this term I found additional information on this subject. This link may be useful for others aswell: https://stackoverflow.com/questions/41007099/relative-windowed-running-sum-through-data-table-non-equi-join — Franck, Sep 29 '17 at 08:22

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

This can be solved using a non-equi join:

library(data.table)
DT <- as.data.table(dataframe)
DT[, count := DT[DT[, .(date, dm30 = date - 30)], on = .(date >= dm30, date < date), 
                 .N, by = .EACHI]$N][]

         date brandname status case number count
1: 2017-01-01        x1 closed       12345     0
2: 2017-01-01        x2 closed       12345     0
3: 2017-01-01        x3 closed       12345     0
4: 2017-01-02        x4   open        7864     3
5: 2017-01-03        x5   open       78642     4

Explanation

The solution essentially consists of three steps:

DT[, .(date, dm30 = date - 30)] creates a helper table with the date ranges to be aggregated.
Then, this is right joined with DT using the non-equi join conditions and simultaneously aggregating by the join conditions. .N is a special symbol containing the number of rows in each group.
Finally, the counts are appended as new column to DF.

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 29 '17 at 03:35

Uwe

41,420
11
90
134

Thank you for the structured explanation of a working solution! I was wondering (I am not familiar with the DT syntax yet): Can your formula be edited so it joins the data on a sequential list of dates? (2017-09-01, 2017-09-02, 2017-09-03, ...). What I want to end up with is a dataframe with all (unique) dates and a count for each date. – Franck Sep 29 '17 at 08:20
If found solution for this. I will add it as an answer – Franck Sep 29 '17 at 09:23
@Franck I have a solution but I am hesitating to post it as an answer to *this* question because the expected result is very different to your original question. Also, new sample data are required to demonstrate the effect. I suggest that you ask a new question. – Uwe Sep 29 '17 at 09:24
Our comments just crossed. Please, consider to post a new question as changing the objective of a question (beyond clarification) is strongly discouraged on SO. Thank you. – Uwe Sep 29 '17 at 09:27
@Franck I have set up a chat room https://chat.stackoverflow.com/rooms/155575/discussion-between-uwe-and-franck and posted new sample data, code and result for the time being. Please, have a look. – Uwe Sep 29 '17 at 09:40
Good one! I will create an additional post for this today. – Franck Sep 29 '17 at 09:40

score 0 · Answer 2 · answered Sep 29 '17 at 09:35

This a solution: It is slightly different than the question. This version uses the date from a dataframe to select the subset in another dataframe.

For each date in a date range it counts the number of occurrences in another dataframe (between each date and date - 30 days). The result of the count is then stored in a new column next to the list of sequential dates.

Data:

-Dataframe-
          date
1   2016-07-12
2   2016-08-03
3   2016-08-24
4   2016-09-27
...
265 2017-09-27
266 2017-09-28
267 2017-09-28
268 2017-09-28

Code:

churn_frame <- as.data.frame(dates) #this is a sequential list of dates (calendar)

churn_frame$new <- sapply(churn_frame$dates, function(x){
    sum(dataframe$date <= x & dataframe$date >= x - 29)} #-29 days to cover exactly 30 days.
    )

Result:

         date new
1  2017-07-31  10
2  2017-08-01  10
3  2017-08-02  11
4  2017-08-03  10
5  2017-08-04  12
...
58 2017-09-26  11
59 2017-09-27  12
60 2017-09-28  14
61 2017-09-29  12

As discussed below, this post is answering a different question. In addition, it is not reproducible because you did not post all input data (Please, remember to post a reproducible expression when posting a new question). Furthermore, you have changed the conditions. In your original question the data range was past 30 days ending one day before the actual day ("yesterday"). This answer is using past 29 days plus the actual date ("today"). — Uwe, Sep 29 '17 at 09:46

Running count of rows between dates in R

2 Answers2

Explanation