R Count continuous number of a date variable by id

Question

I have a database like this:

dat = data.frame(id = c(rep("Adam", 5), rep("Bob", 10)), 
                 card_id = c(rep("0001", 2), rep("0002", 3), rep("0003", 5), rep("0004", 5)), 
                 bill_date = c("2017-01", "2017-02", 
                               "2017-01", "2017-02", "2017-03", 
                               "2017-01", "2017-02", "2017-03", "2017-05", "2017-06", 
                               "2017-01", "2017-02", "2017-03", "2017-04", "2017-05"), stringsAsFactors = F)

#      id card_id bill_date
# 1  Adam    0001   2017-01
# 2  Adam    0001   2017-02
# 3  Adam    0002   2017-01
# 4  Adam    0002   2017-02
# 5  Adam    0002   2017-03
# 6   Bob    0003   2017-01
# 7   Bob    0003   2017-02
# 8   Bob    0003   2017-03
# 9   Bob    0003   2017-05
# 10  Bob    0003   2017-06
# 11  Bob    0004   2017-01
# 12  Bob    0004   2017-02
# 13  Bob    0004   2017-03
# 14  Bob    0004   2017-04
# 15  Bob    0004   2017-05

I want to get the max continuous bill number of each card_id for each id.

Expected result:

#      id card_id max_cont_bill_num
# 1  Adam    0001                 2
# 2  Adam    0002                 3
# 3   Bob    0003                 3
# 4   Bob    0004                 5

See my revised answer below with a full solution to the problem that should be full-proof. — iod, Oct 11 '18 at 18:56

akrun · Answer 1 · 2018-10-11T04:40:31.633

1

We can try with difftime and rle. Create a function to convert the 'bill_date' to Date, class, get the difference of the adjacent dates, convert as an integer, then get the run-length-id of the values, select the maximum length. Apply that function after grouping by 'id', 'card_id'

f1 <- function(x)  {
                x1 <- as.Date(paste0(x, "-01") )
                x2 <- as.integer(difftime(x1[-1], x1[-length(x1)], unit = "weeks"))
                  max(rle(c(x2[1], x2))$lengths) 
                  }

dat %>% 
   group_by(id, card_id) %>%
    summarise(max_count = f1(bill_date))        
# A tibble: 4 x 3
# Groups:   id [?]
#  id    card_id max_count
#  <chr> <chr>       <int>
#1 Adam  0001            2
#2 Adam  0002            3
#3 Bob   0003            3
#4 Bob   0004            5

edited Oct 11 '18 at 04:40

answered Oct 11 '18 at 04:10

akrun

874,273
37
540
662

Cool. Can you explain the rle() step a bit further? What are we measuring there? Is it the number of times 1 is the difference in time consecutively? What if there's a series of skip-by-2 in the data? – iod Oct 11 '18 at 12:19
@iod That part is not clear to me about the sequence of 1s or 2s. I assume it to be a unit of 1 difference – akrun Oct 11 '18 at 15:55

iod · Answer 2 · 2018-10-11T18:55:43.787

dat %>% mutate(numb=as.integer(gsub("^[0-9]*-","",bill_date))) %>% 
  group_by(id, card_id) %>% summarize(maxc=numb[sum(numb==numb[1]:(numb[n()]+numb[1]-1))])

Result:

# A tibble: 4 x 3
# Groups:   id [?]
  id    card_id  maxc
  <chr> <chr>   <int>
1 Adam  0001        2
2 Adam  0002        3
3 Bob   0003        3
4 Bob   0004        5

Explanation: The mutate creates a column with just the month numbers. We then group by id and card_id, and create a summary that compares the vector of numbers in a group with a vector of equal length that runs from the first month in the group to the last, sums all the cases where the answer is "TRUE", and then returns the number in the row that corresponds to that sum (i.e., the last number that was consecutive).

Note: this will break when the bill_date spans more than one year continuously.

OK, I was trying to think how to get around that issue with multi-year spanning cases, and I came up with this solution.

First, new data, with one case spanning two years:

dat = data.frame(id = c(rep("Adam", 5), rep("Bob", 10)), 
                  card_id = c(rep("0001", 2), rep("0002", 3), rep("0003", 5), rep("0004", 5)), 
                  bill_date = c("2017-01", "2017-02", 
                                "2017-12", "2018-01", "2018-03", 
                                "2017-01", "2017-02", "2017-03", "2017-05", "2017-06", 
                                "2017-01", "2017-02", "2017-03", "2017-04", "2017-05"), stringsAsFactors = F)

Now, we'll change the bill-date into actual dates, and use cut.POSIXt to create a sequence of months, and the apply the previous solution to that sequence:

dat$monthseq<-cut.POSIXt(as.POSIXct(paste0(dat$bill_date,"-01")),breaks="month",labels=FALSE)
dat%>% 
  group_by(id, card_id) %>% 
  summarize(maxc=bill_date[sum(monthseq==monthseq[1]:(monthseq[n()]))])

Result:

# A tibble: 4 x 3
# Groups:   id [?]
  id    card_id maxc   
  <chr> <chr>   <chr>  
1 Adam  0001    2017-02
2 Adam  0002    2018-01
3 Bob   0003    2017-03
4 Bob   0004    2017-05

score 0 · Answer 3 · answered Oct 11 '18 at 19:23

I apologize, I did not read the OP answer closely enough the first time.

library(tidyverse); library(lubridate)
dat %>%
  # new group for each gap in time series or new id / card_id
  mutate(date = ymd(bill_date,truncated = 2),
         gap = id  != lag(id, default = "") | 
           card_id != lag(card_id, default = "") |
              date != lag(date) %m+% months(1),
         group = cumsum(gap)) %>%
  # How many in each group?
  count(id, card_id, group) %>%
  # Just keep each id / card_id's longest streak
  group_by(id, card_id) %>%
  top_n(1, wt = n)

R Count continuous number of a date variable by id

3 Answers3