0

My task is to count the length of periods from given start/end date that were extracted from the large dataset.

Here is sample data.

library(tidyverse)
data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
               start = ymd(c("2022-03-03", "2022-03-03", "2022-03-04", "2022-03-20", "2022-03-22")),
               end = ymd(c("2022-03-03", "2022-03-04", "2022-03-07", "2022-03-22", "2022-03-23"))) 

data
# A tibble: 5 × 3
     ID start      end       
  <dbl> <date>     <date>    
1     1 2022-03-03 2022-03-03
2     1 2022-03-03 2022-03-04
3     1 2022-03-04 2022-03-07
4     2 2022-03-20 2022-03-22
5     2 2022-03-22 2022-03-23

I've figured out this with the method introduced here.

data2 <- data %>% 
  rowwise() %>%
  do(tibble(ID = .$ID, 
            Date = seq(.$start, .$end, by = 1))) %>% 
  distinct() %>%
  ungroup() %>%
  count(ID) 

data2
# A tibble: 2 × 2
     ID     n
  <dbl> <int>
1     1     5
2     2     4

However, occasionally, all the observations in the extracted start/end columns are NA.

Then the method above stops at the function seq() because no data is there.

like

na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
                  start = ymd(NA),
                  end = ymd(NA)) 
na_data    
A tibble: 5 × 3
     ID start  end   
  <dbl> <date> <date>
1     1 NA     NA    
2     1 NA     NA    
3     1 NA     NA    
4     2 NA     NA    
5     2 NA     NA    

na_data %>% 
  rowwise() %>%
  do(tibble(ID = .$ID, 
            Date = seq(.$start, .$end, by = 1))) %>% 
  distinct() %>%
  ungroup() %>%
  count(ID) 
*Error in seq.int(0, to0 - from, by) : 'to' must be a finite number*

It is difficult for me to check if all the data in selected columns are NA beforehand, because I have a lot of this kind of process to run simultaneously with the data from the same dataset.

To run the process, I usually select entire scripts in Rstudio with [ctrl + A] and then start. But the error message interrupts in the middle of my tasks.

Does Anyone have a solution to achieve this process with a whole NA data, or to avoid interruption by the error message and proceed to the next code?

Thank you.

KintensT
  • 7
  • 2
  • Can you clarify the logic for getting your final values? Is it simply `max(end) - min(start)` within each `ID`? Or `sum(end - start)` for each `ID`? Or something else? – zephryl Nov 25 '22 at 21:40

1 Answers1

0

This solution (1) creates lubridate Intervals for each row; (2) merges them by group using a modification of @AllanCameron's int_merge() function to handle NAs; and (3) sums days per Interval within each group.

To fully test it, I made two additional example datasets -- one including discontinuous date intervals, and one where only some values are NA.

library(lubridate)
library(dplyr)

int_merge <- function(x, na.rm = FALSE) {
  if (na.rm) {
    if(all(is.na(x))) return(interval(NA, NA))
    if(any(is.na(x))) x <- x[!is.na(x)]
  } else {
    if(any(is.na(x))) return(interval(NA, NA))
  }
  if(length(x) == 1) return(x)
  x <- x[order(int_start(x))]
  y <- x[1]
  for(i in 2:length(x)){
    if(int_overlaps(y[length(y)], x[i]))
      y[length(y)] <- interval(start = min(int_start(c(y[length(y)], x[i]))),
                               end = max(int_end(c(y[length(y)], x[i]))))
    else
      y <- c(y, x[i])
  }
  return(y)
}

data %>% 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1        5
#> 2     2        4

discontinuous_data %>% 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1        8
#> 2     2        4

na_data %>% 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1       NA
#> 2     2       NA

partial_na_data %>%
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1       NA
#> 2     2        4

partial_na_data %>%  # with `na.rm = TRUE` 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval, na.rm = TRUE), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1        7
#> 2     2        4

Created on 2022-11-25 with reprex v2.0.2

Additional example data:

discontinuous_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
               start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
               end = ymd(c("2022-03-03", "2022-03-04", "2022-03-15", "2022-03-22", "2022-03-23"))) 

partial_na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
                             start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
                             end = ymd(c("2022-03-03", NA, "2022-03-15", "2022-03-22", "2022-03-23"))) 
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • Thank you for the suggestion. But the start/end dates not necessarily given in the continuous manner. When I got them with blank, like "2022-03-03 to 2022-03-04" and "2022-03-10 to 2022-03-15", I have to ignore from "-03-05" to "-03-09". Your "first-last" method may include the blank period. – KintensT Nov 25 '22 at 22:09
  • Got it, see my edited solution. In the future, it's best to include example data / outputs that fully illustrate your requirements. – zephryl Nov 25 '22 at 23:26