2

I am trying to figure out how to create a cumulative or rolling sum in R based on a few conditions.

The data set in question is a few million observations of library loans, and the question is to determine how many copies of a given book/title would be necessary to meet demand.

So for each Title.ID, begin with 1 copy for the first instance (ID.Index). Then for each instance after, determine whether another copy is needed based on whether the REQUEST.DATE is within 16 weeks (112 days) of the previous request.

# A tibble: 15 x 3
# Groups:   Title.ID [2]
   REQUEST.DATE Title.ID ID.Index
   <date>          <int>    <int>
 1 2013-07-09          2        1
 2 2013-08-07          2        2
 3 2013-08-20          2        3
 4 2013-09-08          2        4
 5 2013-09-28          2        5
 6 2013-12-27          2        6
 7 2014-02-10          2        7
 8 2014-03-12          2        8
 9 2014-03-14          2        9
10 2014-08-27          2       10
11 2014-04-27          6        1
12 2014-08-01          6        2
13 2014-11-13          6        3
14 2015-02-14          6        4
15 2015-05-14          6        5

The tricky part is that determining whether a new copy is needed is based not only on the number of request (ID.Index) and the REQUEST.DATE of some previous loan, but also on the preceding accumulating sum.

For instance, for the third request for title 2 (Title.ID 2, ID.Index 3), there are now two copies, so to determine whether a new copy is needed, you have to see whether the REQUEST.DATE is within 112 days of the first (not second) request (ID.Index 1). By contrast, for the third request for title 6 (Title.ID 6, ID.Index 3), there is only one copy available (since request 2 was not within 112 days), so determining whether a new copy is needed is based on looking back to the REQUEST.DATE of ID.Index 2.

The desired output ("Copies") would take each new request (ID.Index), then look back to the relevant REQUEST.DATE based on the number of available copies, and doing that would mean looking at the accumulating sum for the preceding calculation. (Note: The max number of copies would be 10.)

I've provided the desired output for the sample below ("Copies").

# A tibble: 15 x 4
# Groups:   Title.ID [2]
   REQUEST.DATE Title.ID ID.Index Copies
   <date>          <int>    <int>  <dbl>
 1 2013-07-09          2        1      1
 2 2013-08-07          2        2      2
 3 2013-08-20          2        3      3
 4 2013-09-08          2        4      4
 5 2013-09-28          2        5      5
 6 2013-12-27          2        6      5
 7 2014-02-10          2        7      5
 8 2014-03-12          2        8      5
 9 2014-03-14          2        9      5
10 2014-08-27          2       10      5
11 2014-04-27          6        1      1
12 2014-08-01          6        2      2
13 2014-11-13          6        3      2
14 2015-02-14          6        4      2
15 2015-05-14          6        5      2
> 

I recognize that the solution will be way beyond my abilities, so I will be extremely grateful for any solution or advice about how to solve this type of problem in the future.

Thanks a million!

*4/19 update: new examples where new copy may be added after delay, i.e., not in sequence. I've also added columns showing days since a given previous request, which helps checking whether a new copy should be added, based on how many copies there are.

Sample 2: new copy should be added with third request, since it has only been 96 days since last request (and there is only one copy)

REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
  <fct>          <date>          <int>    <int> <drtn>     <drtn>      <drtn>      <drtn>      <drtn>       <int>
1 BRO-10680332   2013-10-17          6        1  NA days    NA days     NA days     NA days     NA days         1
2 PEN-10835735   2014-04-27          6        2 192 days    NA days     NA days     NA days     NA days         1
3 PEN-10873506   2014-08-01          6        3  96 days   288 days     NA days     NA days     NA days         1
4 PEN-10951264   2014-11-13          6        4 104 days   200 days    392 days     NA days     NA days         1
5 PEN-11029526   2015-02-14          6        5  93 days   197 days    293 days    485 days     NA days         1
6 PEN-11106581   2015-05-14          6        6  89 days   182 days    286 days    382 days    574 days         1

Sample 3: new copy should be added with last request, since there are two copies, and the oldest request is 45 days.

REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
  <fct>          <date>          <int>    <int> <drtn>     <drtn>      <drtn>      <drtn>      <drtn>       <int>
1 BRO-10999392   2015-01-20         76        1  NA days    NA days     NA days     NA days     NA days         1
2 YAL-11004302   2015-01-22         76        2   2 days    NA days     NA days     NA days     NA days         2
3 COR-11108471   2015-05-18         76        3 116 days   118 days     NA days     NA days     NA days         2
4 HVD-11136632   2015-07-27         76        4  70 days   186 days    188 days     NA days     NA days         2
5 MIT-11164843   2015-09-09         76        5  44 days   114 days    230 days    232 days     NA days         2
6 HVD-11166239   2015-09-10         76        6   1 days    45 days    115 days    231 days    233 days         2
  • are you trying to count the number of titles within the last 112 days for each row? – chinsoon12 Apr 20 '20 at 01:03
  • no, I'm trying to calculate the total number of copies of a given title necessary to meet need (up to a certain point, say 10). so that requires looking back to previous request dates depending on how many copies there are, which is dependent on each previous request. so if request 2 is within 112 days, it means there are now 2 copies. if then request 3 is after 112 days of the first request (first copy), then nothing is added. but a copy could be added for a later request, if within 112 days of two requests ago (or however many, based on the number of copies). – Matthew Sullivan Apr 20 '20 at 01:35
  • I don’t understand the last sentence. How is it different from number of requests within the last 112 days? – chinsoon12 Apr 20 '20 at 01:42
  • This calculation wouldn't provide the answer by itself. Taking Sample 2 above, you'd get 0 for the first line (although we'd want the number of copies to be 1), then 0 for the second request (so no new copies added), 1 for the third request (so one copy added)--but then for the fourth request, you get 1 again, but we don't want to add a copy because there are two copies, and the first one was loaned 200 days ago, so it is available. So when there's only one copy, looking to 112 days works; but when there's more than 1 copy, you need to look back 112 days x the number of copies. Thanks. – Matthew Sullivan Apr 20 '20 at 14:09

2 Answers2

0

You can use runner package to apply any R function on cumulative window. This time we execute function f using x = REQUEST.DATE. We just count number of observations which are within min(x) + 112.

library(dplyr)
library(runner)

data %>%
  group_by(Title.ID) %>%
  mutate(
    Copies = runner(
      x = REQUEST.DATE,
      f = function(x) {
        length(x[x <= (min(x + 112))])
      }
    )
  )

# # A tibble: 15 x 4
# # Groups:   Title.ID [2]
#        REQUEST.DATE Title.ID ID.Index Copies
#        <date>          <int>    <int>  <int>
#    1 2013-07-09          2        1      1
#    2 2013-08-07          2        2      2
#    3 2013-08-20          2        3      3
#    4 2013-09-08          2        4      4
#    5 2013-09-28          2        5      5
#    6 2013-12-27          2        6      5
#    7 2014-02-10          2        7      5
#    8 2014-03-12          2        8      5
#    9 2014-03-14          2        9      5
#   10 2014-08-27          2       10      5
#   11 2014-04-27          6        1      1
#   12 2014-08-01          6        2      2
#   13 2014-11-13          6        3      2
#   14 2015-02-14          6        4      2
#   15 2015-05-14          6        5      2


data

data <- read.table(
  text = "   REQUEST.DATE Title.ID ID.Index
 1 2013-07-09          2        1
 2 2013-08-07          2        2
 3 2013-08-20          2        3
 4 2013-09-08          2        4
 5 2013-09-28          2        5
 6 2013-12-27          2        6
 7 2014-02-10          2        7
 8 2014-03-12          2        8
 9 2014-03-14          2        9
10 2014-08-27          2       10
11 2014-04-27          6        1
12 2014-08-01          6        2
13 2014-11-13          6        3
14 2015-02-14          6        4
15 2015-05-14          6        5", 
  header = TRUE)

data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))

GoGonzo
  • 2,637
  • 1
  • 18
  • 25
  • Thank you so much for this quick response. It definitely seems to work for the examples I provided, but on further inspection, I think one place where it gets tripped up is when there is not a continuous sequence of adding copies. For instance, if the second request does not trigger a new copy, but the third or fourth should, this does not seem to happen. I'll add two more examples to the original post. – Matthew Sullivan Apr 19 '20 at 19:55
0

I was able to find a workable solution based on finding the max number of other requests within 112 days of a request (after creating return date), for each title.

data$RETURN.DATE <- as.Date(data$REQUEST.DATE + 112)
data <- data %>% 
  group_by(Title.ID) %>% 
  mutate(
    Copies = sapply(REQUEST.DATE, function(x)
      sum(as.Date(REQUEST.DATE) <= as.Date(x) &
            as.Date(RETURN.DATE) >= as.Date(x)
      ))
  )

Then I de-duplicated the list of titles, using the max number for each title, and added it back to the original data.

I still think there's a solution to the original problem, where I could go back and see at which point new copies needed to be added (for analysis based on when a title is published), but this works for now.