When I use the lag function, how do I "skip" dates in previous rows that are the same as the date in the current row?

Question

I have a dataframe with columns 'person', 'NoShow', and 'date'.

  Person NoShow       date
1    334      1 2019-07-07
2    334      1 2019-10-11
3    334      1 2020-07-07
4    334      1 2021-01-04
5    999      1 2021-06-03
6    999      1 2022-04-17
7    999      1 2022-05-03
8    999      0 2022-05-03

Code to create the dataframe:

df <- data.frame(Person = c('334','334','334','334','999','999','999','999'), 
                 NoShow = c(1,1,1,1,1,1,1,0), 
                 date = c('2019-07-07','2019-10-11','2020-07-07','2021-01-04','2021-06-03','2022-04-17','2022-05-03','2022-05-03')
)

I run the following code to create three new columns: (1) 'prior_noshow', which indicates whether a person missed any previous appointments; (2) 'prior_noshow_f', which indicates the number of previous appointments a person missed; and (3) 'last_noshow', which indicates the last time somebody missed a past appointment.

library(tidyverse)
df <- df %>% 
  group_by(Person) %>%
  mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
         prior_noshow_f = lag(cumsum(NoShow)))

df$prior_noshow_f[is.na(df$prior_noshow_f)] <- 0

df <- df %>% 
  group_by(Person) %>%
  mutate(last_noshow = ifelse(c(FALSE, diff(prior_noshow_f))!= 0, lag(date), NA)) %>%
  fill(last_noshow)

This creates the following output:

  Person NoShow date       prior_noshow prior_noshow_f last_noshow
  <chr>   <dbl> <chr>             <dbl>          <dbl> <chr>      
1 334         1 2019-07-07           NA              0 NA         
2 334         1 2019-10-11            1              1 2019-07-07 
3 334         1 2020-07-07            1              2 2019-10-11 
4 334         1 2021-01-04            1              3 2020-07-07 
5 999         1 2021-06-03           NA              0 NA         
6 999         1 2022-04-17            1              1 2021-06-03 
7 999         1 2022-05-03            1              2 2022-04-17 
8 999         0 2022-05-03            1              3 2022-05-03

However, I only want to consider it a "prior no-show" when the current date is later than the date in the previous row. So, row 8 should have 'prior_noshow_f'==2 and 'last_noshow'==2022-04-17, because rows 7 and 8 have the same date (2022-05-03).

How do I do that?

Rui Barradas · Answer 1 · 2022-08-23T19:20:49.337

Try the following.
Multiply the value of prior_noshow_f by a logical telling whether the date is different from the previous date. This will put zeros where they are equal. Then substitute the zeros by the previous date and be careful with NA's.

As for the date last_noshow, if each group of Person and prior_noshow_f has more than one row, the date should be the date of the previous group. This is handled by assigning NA and then filling down with the previous non-NA value.

df <- data.frame(Person = c('334','334','334','334','999','999','999','999'), 
                 NoShow = c(1,1,1,1,1,1,1,0), 
                 date = c('2019-07-07','2019-10-11','2020-07-07','2021-01-04','2021-06-03','2022-04-17','2022-05-03','2022-05-03')
                 )

suppressPackageStartupMessages(
  library(tidyverse)
)

df %>% 
  group_by(Person) %>%
  mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
         prior_noshow_f = lag(cumsum(NoShow), default = 0)*(date != lag(date)),
         prior_noshow_f = case_when(
           is.na(prior_noshow_f) ~ 0,
           prior_noshow_f == 0 ~ lag(prior_noshow_f),
           TRUE ~ prior_noshow_f
         )) %>%
  group_by(Person, prior_noshow_f) %>%
  mutate(last_noshow = ifelse(n() > 1, NA, date)) %>%
  group_by(Person) %>%
  mutate(last_noshow = zoo::na.locf(last_noshow)) %>%
  ungroup()
#> # A tibble: 8 × 6
#>   Person NoShow date       prior_noshow prior_noshow_f last_noshow
#>   <chr>   <dbl> <chr>             <dbl>          <dbl> <chr>      
#> 1 334         1 2019-07-07           NA              0 2019-07-07 
#> 2 334         1 2019-10-11            1              1 2019-10-11 
#> 3 334         1 2020-07-07            1              2 2020-07-07 
#> 4 334         1 2021-01-04            1              3 2021-01-04 
#> 5 999         1 2021-06-03           NA              0 2021-06-03 
#> 6 999         1 2022-04-17            1              1 2022-04-17 
#> 7 999         1 2022-05-03            1              2 2022-04-17 
#> 8 999         0 2022-05-03            1              2 2022-04-17

^{Created on 2022-08-23 by the reprex package (v2.0.1)}

When I use the lag function, how do I "skip" dates in previous rows that are the same as the date in the current row?

1 Answers1