R:Avoid error when counting sequential number of date with data that all the observations are NA

Question

My task is to count the length of periods from given start/end date that were extracted from the large dataset.

Here is sample data.

library(tidyverse)
data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
               start = ymd(c("2022-03-03", "2022-03-03", "2022-03-04", "2022-03-20", "2022-03-22")),
               end = ymd(c("2022-03-03", "2022-03-04", "2022-03-07", "2022-03-22", "2022-03-23"))) 

data
# A tibble: 5 × 3
     ID start      end       
  <dbl> <date>     <date>    
1     1 2022-03-03 2022-03-03
2     1 2022-03-03 2022-03-04
3     1 2022-03-04 2022-03-07
4     2 2022-03-20 2022-03-22
5     2 2022-03-22 2022-03-23

I've figured out this with the method introduced here.

data2 <- data %>% 
  rowwise() %>%
  do(tibble(ID = .$ID, 
            Date = seq(.$start, .$end, by = 1))) %>% 
  distinct() %>%
  ungroup() %>%
  count(ID) 

data2
# A tibble: 2 × 2
     ID     n
  <dbl> <int>
1     1     5
2     2     4

However, occasionally, all the observations in the extracted start/end columns are NA.

Then the method above stops at the function seq() because no data is there.

like

na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
                  start = ymd(NA),
                  end = ymd(NA)) 
na_data    
A tibble: 5 × 3
     ID start  end   
  <dbl> <date> <date>
1     1 NA     NA    
2     1 NA     NA    
3     1 NA     NA    
4     2 NA     NA    
5     2 NA     NA    

na_data %>% 
  rowwise() %>%
  do(tibble(ID = .$ID, 
            Date = seq(.$start, .$end, by = 1))) %>% 
  distinct() %>%
  ungroup() %>%
  count(ID) 
*Error in seq.int(0, to0 - from, by) : 'to' must be a finite number*

It is difficult for me to check if all the data in selected columns are NA beforehand, because I have a lot of this kind of process to run simultaneously with the data from the same dataset.

To run the process, I usually select entire scripts in Rstudio with [ctrl + A] and then start. But the error message interrupts in the middle of my tasks.

Does Anyone have a solution to achieve this process with a whole NA data, or to avoid interruption by the error message and proceed to the next code?

Thank you.

Can you clarify the logic for getting your final values? Is it simply `max(end) - min(start)` within each `ID`? Or `sum(end - start)` for each `ID`? Or something else? — zephryl, Nov 25 '22 at 21:40

zephryl · Answer 1 · 2022-11-25T23:24:49.737

This solution (1) creates lubridate Intervals for each row; (2) merges them by group using a modification of @AllanCameron's int_merge() function to handle NAs; and (3) sums days per Interval within each group.

To fully test it, I made two additional example datasets -- one including discontinuous date intervals, and one where only some values are NA.

library(lubridate)
library(dplyr)

int_merge <- function(x, na.rm = FALSE) {
  if (na.rm) {
    if(all(is.na(x))) return(interval(NA, NA))
    if(any(is.na(x))) x <- x[!is.na(x)]
  } else {
    if(any(is.na(x))) return(interval(NA, NA))
  }
  if(length(x) == 1) return(x)
  x <- x[order(int_start(x))]
  y <- x[1]
  for(i in 2:length(x)){
    if(int_overlaps(y[length(y)], x[i]))
      y[length(y)] <- interval(start = min(int_start(c(y[length(y)], x[i]))),
                               end = max(int_end(c(y[length(y)], x[i]))))
    else
      y <- c(y, x[i])
  }
  return(y)
}

data %>% 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1        5
#> 2     2        4

discontinuous_data %>% 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1        8
#> 2     2        4

na_data %>% 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1       NA
#> 2     2       NA

partial_na_data %>%
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1       NA
#> 2     2        4

partial_na_data %>%  # with `na.rm = TRUE` 
  mutate(interval = interval(start, end)) %>% 
  group_by(ID) %>% 
  summarize(
    interval = sum(as.numeric(int_merge(interval, na.rm = TRUE), unit = "days") + 1)
  )
#> # A tibble: 2 × 2
#>      ID interval
#>   <dbl>    <dbl>
#> 1     1        7
#> 2     2        4

^{Created on 2022-11-25 with reprex v2.0.2}

Additional example data:

discontinuous_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
               start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
               end = ymd(c("2022-03-03", "2022-03-04", "2022-03-15", "2022-03-22", "2022-03-23"))) 

partial_na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
                             start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
                             end = ymd(c("2022-03-03", NA, "2022-03-15", "2022-03-22", "2022-03-23")))

Thank you for the suggestion. But the start/end dates not necessarily given in the continuous manner. When I got them with blank, like "2022-03-03 to 2022-03-04" and "2022-03-10 to 2022-03-15", I have to ignore from "-03-05" to "-03-09". Your "first-last" method may include the blank period. — KintensT, Nov 25 '22 at 22:09
Got it, see my edited solution. In the future, it's best to include example data / outputs that fully illustrate your requirements. — zephryl, Nov 25 '22 at 23:26

R:Avoid error when counting sequential number of date with data that all the observations are NA

1 Answers1