1

I have a dataset, where I have different provider attending a site multiple times.

I want to create columns to show start and stop dates where they are present.

Here's a sample dataset:

x <- tibble(
  "site" = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
  "provider" = c("A", "B", "C", "A", "A", "C", "C", "A", "C", "A"),
  "date" = seq(from = as.Date("2019-01-01"), to = as.Date("2019-01-10"), by = 'day'),
)

I've been unable to create a bit of code that adequately captures the fact that a provider might be present several times. The best I got so far is:

x %>%
  group_by(site, provider) %>%
  mutate("start" = min(date),
         "end" = max(date))

Which creates this:

# A tibble: 10 x 5
# Groups:   site, provider [5]
    site provider date       start      end       
   <dbl> <chr>    <date>     <date>     <date>    
 1     1 A        2019-01-01 2019-01-01 2019-01-05
 2     1 B        2019-01-02 2019-01-02 2019-01-02
 3     1 C        2019-01-03 2019-01-03 2019-01-03
 4     1 A        2019-01-04 2019-01-01 2019-01-05
 5     1 A        2019-01-05 2019-01-01 2019-01-05
 6     2 C        2019-01-06 2019-01-06 2019-01-09
 7     2 C        2019-01-07 2019-01-06 2019-01-09
 8     2 A        2019-01-08 2019-01-08 2019-01-10
 9     2 C        2019-01-09 2019-01-06 2019-01-09
10     2 A        2019-01-10 2019-01-08 2019-01-10

However, this only process the individual provider once per site.

Here is what I would like the final dataset to look like:

# A tibble: 10 x 5
# Groups:   site, provider [5]
    site provider date       start      end       
   <dbl> <chr>    <date>     <date>     <date>    
 1     1 A        2019-01-01 2019-01-01 2019-01-01  # A stops at 2019-01-01
 2     1 B        2019-01-02 2019-01-02 2019-01-02
 3     1 C        2019-01-03 2019-01-03 2019-01-03
 4     1 A        2019-01-04 2019-01-04 2019-01-05  # A restarts from 2019-01-04
 5     1 A        2019-01-05 2019-01-04 2019-01-05
 6     2 C        2019-01-06 2019-01-06 2019-01-07
 7     2 C        2019-01-07 2019-01-06 2019-01-07  # C stops at 2019-01-07
 8     2 A        2019-01-08 2019-01-08 2019-01-08  # A stops at 2019-01-08
 9     2 C        2019-01-09 2019-01-09 2019-01-09  # C restarts at 2019-01-09
10     2 A        2019-01-10 2019-01-10 2019-01-10  # A restarts at 2019-01-10
SorenK
  • 155
  • 9

2 Answers2

1

Use data.table::rleid to create an additional group

library(dplyr)

x %>%
  group_by(site, provider, group = data.table::rleid(provider)) %>%
  mutate(start = min(date), end = max(date)) %>%
  ungroup() %>%
  select(-group)

 # A tibble: 10 x 5
#    site provider date       start      end       
#   <dbl> <chr>    <date>     <date>     <date>    
# 1     1 A        2019-01-01 2019-01-01 2019-01-01
# 2     1 B        2019-01-02 2019-01-02 2019-01-02
# 3     1 C        2019-01-03 2019-01-03 2019-01-03
# 4     1 A        2019-01-04 2019-01-04 2019-01-05
# 5     1 A        2019-01-05 2019-01-04 2019-01-05
# 6     2 C        2019-01-06 2019-01-06 2019-01-07
# 7     2 C        2019-01-07 2019-01-06 2019-01-07
# 8     2 A        2019-01-08 2019-01-08 2019-01-08
# 9     2 C        2019-01-09 2019-01-09 2019-01-09
#10     2 A        2019-01-10 2019-01-10 2019-01-10
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you Ronak, while this works perfectly in this sample dataset, for some reason it does not work in my larger dataset (10 years, 55 different sites, 3 different providers). Instead, when I use your code in my dataset, I only get the first `provider` for the entire period per `site` (I tried leaving the `group` column, and true enough, there is only one run length group ID per `site`). Any thoughts as to why? – SorenK Jun 05 '19 at 10:44
  • @SorenK Are they arranged properly? Try to `arrange` them first `x %>% arrange(site, date) %>% group_by(site, group = data.table::rleid(provider)) %>% mutate(start = min(date), end = max(date)) %>% ungroup() %>% select(-group) ` – Ronak Shah Jun 05 '19 at 11:00
  • yes I tried this, and still it only gives a `start` and `end` date for the first `provider` at the `site`. I've had a look around stackoverflow to see if alternatives to rleid could be used (as discussed [here](https://stackoverflow.com/questions/33507868/is-there-a-dplyr-equivalent-to-data-tablerleid)) but haven't found a working solution. – SorenK Jun 05 '19 at 11:09
  • 1
    @SorenK I don't think it should make a difference but try `x %>% group_by(site, provider, group = data.table::rleid(provider)) %>% mutate(start = min(date), end = max(date)) %>% ungroup() %>% select(-group)` – Ronak Shah Jun 05 '19 at 11:12
  • That did it! Adding in `provider` to the `group_by` pipe sorted out the code neatly. thank you, that's really helpful. – SorenK Jun 05 '19 at 11:16
  • 1
    Now, I am confused as to in which case `provider` and `rleid(provider)` differ. :D Anyway, I am glad you could figure this out in the end. :) – Ronak Shah Jun 05 '19 at 11:22
1

An option using only dplyr where the 'grp' is created by comparing adjacent elements of 'provider', take the cumulative sum of the logical vector and find the min and max of 'date' after grouping by 'site', 'provider' and 'grp'

library(dplyr)
x %>%
   group_by(site, provider, 
      grp = cumsum(provider != lag(provider, default = first(provider))))  %>% 
   mutate(start = min(date), end = max(date)) %>% 
   ungroup %>%
   select(-grp)
# A tibble: 10 x 5
#    site provider date       start      end       
#   <dbl> <chr>    <date>     <date>     <date>    
# 1     1 A        2019-01-01 2019-01-01 2019-01-01
# 2     1 B        2019-01-02 2019-01-02 2019-01-02
3 3     1 C        2019-01-03 2019-01-03 2019-01-03
# 4     1 A        2019-01-04 2019-01-04 2019-01-05
# 5     1 A        2019-01-05 2019-01-04 2019-01-05
# 6     2 C        2019-01-06 2019-01-06 2019-01-07
# 7     2 C        2019-01-07 2019-01-06 2019-01-07
# 8     2 A        2019-01-08 2019-01-08 2019-01-08
# 9     2 C        2019-01-09 2019-01-09 2019-01-09
#10     2 A        2019-01-10 2019-01-10 2019-01-10
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Cheers @akrun, this version is neat and works on my larger dataset. I like that it's a pure `dplyr` and appreciate the explanation of how it produces the output. – SorenK Jun 06 '19 at 15:08
  • 1
    @SorenK Thanks, I just wanted to provide the solution without any additional dependencies. It may be easier to have lots of packages, but on a production environment, every additional package maintenance is its own headache (with versions and related things) – akrun Jun 06 '19 at 15:09