How can I use the lag function to skip over rows and provide number of previous occurrences

Question

I have a dataframe with the first three columns 'Person', 'NoShow', and 'date'. I want to create the two additional columns 'prior_noshow' and 'prior_noshow_f', but I cannot figure out how to do this using the lag function.

   Person NoShow       date prior_noshow prior_noshow_f
1     123      0 2019-01-01           NA             NA
2     123      1 2019-04-01            0              0
3     123      0 2020-01-01            1              1
4     123      1 2021-01-01            1              1
5     123      0 2021-04-30            1              2
6     123      0 2022-01-05            1              2
7     334      0 2019-07-07           NA             NA
8     334      1 2019-10-11            0              0
9     334      1 2020-07-07            1              1
10    334      0 2021-01-04            1              2

Column 'prior_noshow' should show whether a person missed a previous appointment. So, person 123 had their first noshow on 2019-04-01, and I want every following row for person 123 to have prior_noshow==1.

Column 'prior_noshow_f" should show the frequency of previously missed appointments. So, person 123 should have prior_noshow_f==1 for dates 2020-01-01 and 2021-01-01 and prior_noshow_f==2 for dates 2021-04-30 and 2022-01-05.

Code to create the dataset:

df <- data.frame(Person = c('123','123','123','123','123','123','334','334','334','334'), 
                   NoShow = c(0,1,0,1,0,0,0,1,1,0), 
                   date = c('2019-01-01','2019-04-01','2020-01-01','2021-01-01','2021-04-30','2022-01-05','2019-07-07','2019-10-11','2020-07-07','2021-01-04')
                   )

I've tried different approaches I saw in previous discussions here regarding the lag-function, but I haven't able to figure this out.

Any help would be greatly appreciated!

Allan Cameron · Accepted Answer · 2022-08-22T16:28:38.713

You need to lag the cumsum of NoShow to get prior_noshow_f, and prior_noshow is just as.numeric(prior_noshow_f > 0), so you could do:

library(tidyverse)

df %>% 
  group_by(Person) %>%
  mutate(prior_noshow = lag(as.numeric(cumsum(NoShow) > 0)),
         prior_noshow_f = lag(cumsum(NoShow)))
#> # A tibble: 10 x 5
#> # Groups:   Person [2]
#>    Person NoShow date       prior_noshow prior_noshow_f
#>     <int>  <int> <chr>             <dbl>          <int>
#>  1    123      0 2019-01-01           NA             NA
#>  2    123      1 2019-04-01            0              0
#>  3    123      0 2020-01-01            1              1
#>  4    123      1 2021-01-01            1              1
#>  5    123      0 2021-04-30            1              2
#>  6    123      0 2022-01-05            1              2
#>  7    334      0 2019-07-07           NA             NA
#>  8    334      1 2019-10-11            0              0
#>  9    334      1 2020-07-07            1              1
#> 10    334      0 2021-01-04            1              2

^{Created on 2022-08-22 with reprex v2.0.2}

Thank you!! I have one follow-up question: How would I also add a column 'prior_noshow_date" that shows the date associated with the most recent NoShow? — Rene, Aug 22 '22 at 16:58

score 0 · Answer 2 · answered Aug 22 '22 at 16:30

Here is a data.table approach.

library(data.table)
setDT(df)[, (c("prior_noshow", "prior_noshow_f")):= {
  p = c(0, shift(NoShow)[-1]);
  ns_f = cumsum(p); ns = 1*(ns_f>0);list(ns,ns_f)
}, Person]

Output:

    Person NoShow       date prior_noshow prior_noshow_f
 1:    123      0 2019-01-01            0              0
 2:    123      1 2019-04-01            0              0
 3:    123      0 2020-01-01            1              1
 4:    123      1 2021-01-01            1              1
 5:    123      0 2021-04-30            1              2
 6:    123      0 2022-01-05            1              2
 7:    334      0 2019-07-07            0              0
 8:    334      1 2019-10-11            0              0
 9:    334      1 2020-07-07            1              1
10:    334      0 2021-01-04            1              2

Input (from @Allan Cameron)

df <- structure(list(Person = c(123L, 123L, 123L, 123L, 123L, 123L, 
334L, 334L, 334L, 334L), NoShow = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 
1L, 1L, 0L), date = structure(c(17897, 17987, 18262, 18628, 18747, 
18997, 18084, 18180, 18450, 18631), class = "Date")), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")

How can I use the lag function to skip over rows and provide number of previous occurrences

2 Answers2