1

I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show. It seems like only some tweaks need to be made to my below code. However, I cannot seem to find the answer anywhere on this site.

transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))

The above code produces the below output:

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
#   PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>   
#1          1      0             0
#2          1      1             4
#3          1      0             0
#4          1      1             4
#5          1      1             4
#6          1      1             4
#7          2      0             0
#8          2      0             0
#9          2      1             3
#10         2      1             3
#11         2      0             0
#12         2      1             3

This is what I desire:

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
#   PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>   
#1          1      0             0
#2          1      1             0
#3          1      0             1
#4          1      1             0
#5          1      1             1
#6          1      1             2
#7          2      0             0
#8          2      0             0
#9          2      1             0
#10         2      1             1
#11         2      0             2
#12         2      1             0

[UPDATE] I would like the consecutive count to be offset by one row down.

Thank you for any help you can offer in advance!

James Marquez
  • 365
  • 4
  • 12
  • I wish I could checkmark all your solutions as they all provided exactly as I wanted. Thanks so much for all of your help! – James Marquez Aug 01 '16 at 17:31

4 Answers4

4

And here's another (similar) data.table approach

library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
#     PatientID Noshow ConsecNoshows
#  1:         1      0             0
#  2:         1      1             1
#  3:         1      0             0
#  4:         1      1             1
#  5:         1      1             2
#  6:         1      1             3
#  7:         2      0             0
#  8:         2      0             0
#  9:         2      1             1
# 10:         2      1             2
# 11:         2      0             0
# 12:         2      1             1

This is basically groups by PatientID and "run-length-encoding" of Noshow and creates sequences using the group sizes while multiplying by Noshow in order to keep only the values when Noshow == 1

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • David, I checked your answer as correct because you provided the shortest code to get the job done. – James Marquez Aug 01 '16 at 17:34
  • After performing more work on my model, I found that I actually need the consecutive count result to be offset by one. So if the first two lines are counted as a noshow, the third line should have a 2 in the ConsecNoshows column. Is this possible? – James Marquez Sep 28 '16 at 15:23
  • 1
    You could do `df[, ConsecNoshows := shift(ConsecNoshows, fill = 0), by = PatientID]` or similar – David Arenburg Sep 28 '16 at 15:30
  • I updated the table in my original question to show the results I'm trying to get. Thanks in advance. – James Marquez Sep 28 '16 at 15:31
  • I did and it worked perfectly! Thanks so much. I'm going to start studying the data.table functions and they are proving to be extremely helpful. Thank you for your help. – James Marquez Sep 28 '16 at 15:39
4

We can use rle from base R (No packages used). Using ave, we group by 'PatientID', get the rle of 'Noshow', multiply the sequence of 'lengths' by the 'values' replicated by 'lengths' to get the expected output.

helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows 
#[1] 0 1 0 1 2 3 0 0 1 2 0 1

As the OP seems to be using 'tbl_df', a solution in dplyr would be

library(dplyr)
df %>%
   group_by(PatientID) %>%
   mutate(ConsecNoshows = helperfn(Noshow))
#     PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>
#1          1      0             0
#2          1      1             1
#3          1      0             0
#4          1      1             1
#5          1      1             2
#6          1      1             3
#7          2      0             0
#8          2      0             0
#9          2      1             1
#10         2      1             2
#11         2      0             0
#12         2      1             1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Akrun, thank you so much for your solutions. They all worked exactly as intended. I especially like the base R solution you provided. – James Marquez Aug 01 '16 at 17:37
3

I would create a helper function to then use whatever implementation you're most comfortable with:

sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}

#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))

#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))

#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
  #    PatientID Noshow Consec
  #        <int>  <int>  <int>
  # 1          1      0      0
  # 2          1      1      1
  # 3          1      0      0
  # 4          1      1      1
  # 5          1      1      2
  # 6          1      1      3
  # 7          2      0      0
  # 8          2      0      0
  # 9          2      1      1
  # 10         2      1      2
  # 11         2      0      0
  # 12         2      1      1
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • Pierre, your solution is really helpful as it gives me multiple options to use. Thank you for enlightening me to the use of creating a helper function. – James Marquez Aug 01 '16 at 17:41
2

The most straight forward way to group consecutive values is to use rleid from data.table, here is an option from data.table package, where you group data by the PatientID as well as rleid of Noshow variable. And also you need the cumsum function to get a cumulative sum of the Noshow variable instead of sum:

library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
#    PatientID Noshow ConsecNoshows
# 1:         1      0             0
# 2:         1      1             1
# 3:         1      0             0
# 4:         1      1             1
# 5:         1      1             2
# 6:         1      1             3
# 7:         2      0             0
# 8:         2      0             0
# 9:         2      1             1
#10:         2      1             2
#11:         2      0             0
#12:         2      1             1
Psidom
  • 209,562
  • 33
  • 339
  • 356