I have a dataframe with columns 'person', 'NoShow', and 'date'.
Person NoShow date
1 334 1 2019-07-07
2 334 1 2019-10-11
3 334 1 2020-07-07
4 334 1 2021-01-04
5 999 1 2021-06-03
6 999 1 2022-04-17
7 999 1 2022-05-03
8 999 0 2022-05-03
Code to create the dataframe:
df <- data.frame(Person = c('334','334','334','334','999','999','999','999'),
NoShow = c(1,1,1,1,1,1,1,0),
date = c('2019-07-07','2019-10-11','2020-07-07','2021-01-04','2021-06-03','2022-04-17','2022-05-03','2022-05-03')
)
I run the following code to create three new columns: (1) 'prior_noshow', which indicates whether a person missed any previous appointments; (2) 'prior_noshow_f', which indicates the number of previous appointments a person missed; and (3) 'last_noshow', which indicates the last time somebody missed a past appointment.
library(tidyverse)
df <- df %>%
group_by(Person) %>%
mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
prior_noshow_f = lag(cumsum(NoShow)))
df$prior_noshow_f[is.na(df$prior_noshow_f)] <- 0
df <- df %>%
group_by(Person) %>%
mutate(last_noshow = ifelse(c(FALSE, diff(prior_noshow_f))!= 0, lag(date), NA)) %>%
fill(last_noshow)
This creates the following output:
Person NoShow date prior_noshow prior_noshow_f last_noshow
<chr> <dbl> <chr> <dbl> <dbl> <chr>
1 334 1 2019-07-07 NA 0 NA
2 334 1 2019-10-11 1 1 2019-07-07
3 334 1 2020-07-07 1 2 2019-10-11
4 334 1 2021-01-04 1 3 2020-07-07
5 999 1 2021-06-03 NA 0 NA
6 999 1 2022-04-17 1 1 2021-06-03
7 999 1 2022-05-03 1 2 2022-04-17
8 999 0 2022-05-03 1 3 2022-05-03
However, I only want to consider it a "prior no-show" when the current date is later than the date in the previous row. So, row 8 should have 'prior_noshow_f'==2 and 'last_noshow'==2022-04-17, because rows 7 and 8 have the same date (2022-05-03).
How do I do that?