R: calculate the number of occurrences of a specific event in a specified time future

Question

my simplified data looks like this:

set.seed(1453); x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07',  '2016-01-12',  '2016-01-16',  '2016-01-20',
             '2016-01-20',  '2016-01-25',  '2016-01-26',  '2016-01-31')


df = data.frame(x, date = as.Date(date))


df 
x       date
1 2016-01-01
0 2016-01-05
1 2016-01-07
0 2016-01-12
0 2016-01-16
1 2016-01-20
1 2016-01-20
0 2016-01-25
0 2016-01-26
1 2016-01-31

I'd like to calculate the number of occurrences for x == 1 within a specified time period, e.g. 14 and 30 days from the current date (but excluding the current entry, if it is x == 1. The desired output would look like this:

solution
x       date x_plus14 x_plus30
1 2016-01-01        1        3
0 2016-01-05        1        4
1 2016-01-07        2        3
0 2016-01-12        2        3
0 2016-01-16        2        3
1 2016-01-20        2        2
1 2016-01-20        1        1
0 2016-01-25        1        1
0 2016-01-26        1        1
1 2016-01-31        0        0

Ideally, I'd like this to be in dplyr, but it is not a must. Any ideas how to achieve this? Thanks a lot for your help!

Two things. 1) Try not to name objects after functions (sample is also a function). 2) It's unclear what you are asking — Sotos, Jan 11 '17 at 14:38
do you need today + 14 observations (rows) or today + 14 days? the latter is much more complicated. — ℕʘʘḆḽḘ, Jan 11 '17 at 14:40
Not sure where is `as_date` from but you could simply do `sample = data.frame(x, date = as.Date(date))` while both avoiding a dependency and converting to matrix and back using `cbind` — David Arenburg, Jan 11 '17 at 14:51
thanks all for useful comments. @Sotos: now I've changed df names — Kasia Kulma, Jan 11 '17 at 16:02
As I said, don't use `cbind` as it converts everything back to a matrix. Why coudn't you just copy/paste my line? — David Arenburg, Jan 11 '17 at 16:05

score 5 · Accepted Answer · answered Jan 11 '17 at 17:48

5

Adding another approach based on findInterval:

cs = cumsum(df$x) # cumulative number of occurences
data.frame(df, 
           plus14 = cs[findInterval(df$date + 14, df$date, left.open = TRUE)] - cs, 
           plus30 = cs[findInterval(df$date + 30, df$date, left.open = TRUE)] - cs)
#   x       date plus14 plus30
#1  1 2016-01-01      1      3
#2  0 2016-01-05      1      4
#3  1 2016-01-07      2      3
#4  0 2016-01-12      2      3
#5  0 2016-01-16      2      3
#6  1 2016-01-20      2      2
#7  1 2016-01-20      1      1
#8  0 2016-01-25      1      1
#9  0 2016-01-26      1      1
#10 1 2016-01-31      0      0

answered Jan 11 '17 at 17:48

alexis_laz

12,884
4
27
37

thanks, @alexis_laz for your answer and help (again)! I really like how simple and elegant it is and it is based on base R. Thanks a million! – Kasia Kulma Jan 12 '17 at 09:23
@KasiaKulma : You're welcome. `findInterval` is, indeed, handy on such situtations and it, also, avoids comparing all dates with all dates to return the number of entries `<` from a specific date. Have a nice day! – alexis_laz Jan 12 '17 at 12:04
This is really elegant! – Kamil S Jaron Jan 18 '17 at 20:48

score 4 · Answer 2 · edited Jan 11 '17 at 19:28

4

Earlier I wasn't including the present date and so numbers didn't match.

library(data.table)
setDT(df)[, `:=`(x14 = sum(df$x[between(df$date, date, date + 14, incbounds = FALSE)]), 
                 x30 = sum(df$x[between(df$date, date, date + 30, incbounds = FALSE)])),
              by = date]

#     x       date x14 x30
#  1: 1 2016-01-01   1   3
#  2: 0 2016-01-05   1   4
#  3: 1 2016-01-07   2   3
#  4: 0 2016-01-12   2   3
#  5: 0 2016-01-16   2   3
#  6: 1 2016-01-20   1   1
#  7: 1 2016-01-20   1   1
#  8: 0 2016-01-25   1   1
#  9: 0 2016-01-26   1   1
# 10: 1 2016-01-31   0   0

Or a general solution that will work for any desired range

vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) := 
            lapply(vec, function(i) sum(df$x[between(df$date, 
                                                     date, 
                                                     date + i, 
                                                     incbounds = FALSE)])),
            by = date]

edited Jan 11 '17 at 19:28

David Arenburg

91,361
17
137
196

answered Jan 11 '17 at 14:54

joel.wilson

8,243
5
28
48

got the issue!! I was including the present date. Just have `date+1` instead of `date` in `between()` – joel.wilson Jan 11 '17 at 15:13
@KasiaKulma need to clarify on whether to include the current date or not though she has mentioned in the question. Because the number didn't match for the first row – joel.wilson Jan 11 '17 at 15:38
thanks for that; and yes, now I edited the post so that it clarifies that the current occurrence, if it is `x==1' should not be included in the calculation. I would not mess with dates, though, as in my original dataset you can find multiple occurrences with the same date, so I can always extract 1 if the solution includes the current record – Kasia Kulma Jan 11 '17 at 16:06
In order to get the desired result in both lines just specify `incbounds = FALSE` and leave the original ranges, for instance `between(sample$date, date, date+30, incbounds = FALSE)` (this will have 3 instead of 4 in the first line) – David Arenburg Jan 11 '17 at 16:16
I'm afraid this gives me inconsistent results: expected_output for entries `x==1` and expected_output - 1 for entries where `x == 0` – Kasia Kulma Jan 11 '17 at 16:21
@KasiaKulma I haven't yet updated my answer based on the edit you made.. Right now no access to RStudio. Shall let you know. Sorry – joel.wilson Jan 11 '17 at 17:01
@KasiaKulma I just think you need to have `date+x` instead of `date+1` to accomodate your change – joel.wilson Jan 11 '17 at 17:35
@KasiaKulma See edit, that should match your desired output – David Arenburg Jan 11 '17 at 19:21
@DavidArenburg, @joel.wilson: thanks a lot for including a more universal solution. I'd like to test it on a different dataset with a similar structure to the one in my example, but including multiple users. I don't know `data.table`. How could use your solution with `dplyr`'s equivalent of `group_by(user_id)`? Thanks again for your help! – Kasia Kulma Jan 12 '17 at 12:34

Axeman · Answer 3 · 2017-01-11T19:23:18.293

3

A concise dplyr and purrr solution:

library(tidyverse)

sample %>% 
  mutate(x_plus14 = map(date, ~sum(x == 1 & between(date, . + 1, . + 14))),
         x_plus30 = map(date, ~sum(x == 1 & between(date, . + 1, . + 30))))

   x       date x_plus14 x_plus30
1  1 2016-01-01        1        4
2  0 2016-01-05        1        4
3  1 2016-01-07        2        3
4  0 2016-01-12        2        3
5  0 2016-01-16        2        3
6  1 2016-01-20        1        1
7  1 2016-01-20        1        1
8  0 2016-01-25        1        1
9  0 2016-01-26        1        1
10 1 2016-01-31        0        0

edited Jan 11 '17 at 19:23

answered Jan 11 '17 at 15:26

Axeman

32,068
8
81
94

thanks, @Axeman, for that, I really like clarity and compactness of your solution. However, I don't understand why your (and other) solution gives `x_plus14 == 1` and gives `x_plus30 == 1` at row 6, as from then to the end of the month there are 2 `x == 1` occurrences (excluding the current one). It's even more surprising that all the other instances have been calculated correctly! – Kasia Kulma Jan 11 '17 at 15:58
It's because of the duplicate dates in rows 6 and 7. You can use `between(sample$date, ., . + 14)` but then the dates themselves will also be included, and we end up with 3. – Axeman Jan 11 '17 at 16:02
that's fine, I can always extract 1 from it. If you edit your answer this way, I'll happily accept it – Kasia Kulma Jan 11 '17 at 16:07
I realised this approach gives me inconsistent results, depending wether the entry is `x==1` or `x==0'. Any ideas how to get around it? – Kasia Kulma Jan 11 '17 at 16:22

davidski · Answer 4 · 2017-01-11T16:52:32.980

2

Here's my stab at it with some dplyr+purrr help. I got slightly different counts due to the <= and >= in the helper function x_next() if you adjust them properly i think you should be able to get what you want. hth.

library("tidyverse")
library("lubridate")
set.seed(1453)

x = sample(0:1, 10, TRUE)
dates = c('2016-01-01', '2016-01-05', '2016-01-07',  '2016-01-12',  '2016-01-16',  '2016-01-20',
         '2016-01-20',  '2016-01-25',  '2016-01-26',  '2016-01-31')

df = data_frame(x = x, dates = lubridate::as_date(dates))

# helper function to calculate the sum of xs in the next days_in_future
x_next <- function(d, days_in_future) {

  df %>% 
    # subset on days of interest
    filter(dates > d & dates <= d + days(days_in_future)) %>% 
    # sum up xs
    summarise(sum = sum(x)) %>% 
    # have to unlist them so that the (following) call to mutate works
    unlist(use.names=F)
  }

# mutate your df
df %>% 
  mutate(xplus14 = map(dates, x_next, 14),
         xplus30 = map(dates, x_next, 30))

edited Jan 11 '17 at 16:52

answered Jan 11 '17 at 15:11

davidski

561
1
4
16

I don't think you are taking into account that `x` should be 1. Also, you should be using `map_dbl`. – Axeman Jan 11 '17 at 15:31
hey @Axeman, don't understand why the downvote...i am taking what you said into account - `sum(x)` will obviously sum over all `x=1` only. using `map_dbl` is _a_ choice but in this case not a wrong one... – davidski Jan 11 '17 at 16:31
Sorry, I missed the `sum(x)`. But you are still getting counts that are too high? – Axeman Jan 11 '17 at 16:33
no worries but there's no way to undo a downvote right?as for the results - i will edit my answer with a strict inequality on the right side and i think it should be correct. note that the original question has a mistake in the example output (row 1 for `x_plus14` is actually 2 if you look at the data). – davidski Jan 11 '17 at 16:42
update: just re-read all the comments and the desired solution seems to _exclude_ the current date. i updated my answer and now i get the same results as your proposal. imo (and its just a matter of taste) - extracting the anonymous function separately lets you re-use it more often without copy-pasting code. – davidski Jan 11 '17 at 16:52

score 1 · Answer 5 · answered Jan 11 '17 at 15:06

As other already mentioned, it is strange that you do not count the day from and you should avoid naming objects by names of functions (sample). However, the code bellow reproduce your desired output:

set.seed(1453); 
x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07',  '2016-01-12',  '2016-01-16',  '2016-01-20',
             '2016-01-20',  '2016-01-25',  '2016-01-26',  '2016-01-31')


sample = data.frame(x = x, date = as.Date(sample$date))

getOccurences <- function(one_row, sample_data, date_range){
  one_date <- as.Date(one_row[2])
  sum(sample$x[sample_data$date > one_date & 
               sample_data$date < one_date + date_range])
}

sample$x_plus14 <- apply(sample,1,getOccurences, sample, 14)
sample$x_plus30 <- apply(sample,1,getOccurences, sample, 30)

sample

   x       date x_plus14 x_plus30
1  1 2016-01-01        1        3
2  0 2016-01-05        1        4
3  1 2016-01-07        2        3
4  0 2016-01-12        2        3
5  0 2016-01-16        2        3
6  1 2016-01-20        1        1
7  1 2016-01-20        1        1
8  0 2016-01-25        1        1
9  0 2016-01-26        1        1
10 1 2016-01-31        0        0

R: calculate the number of occurrences of a specific event in a specified time future

5 Answers5

Linked