I have a large set of data that contains pathology test data for a number of individuals. I present a scaled down data set describing the types of cases.
library(plyr)
library(tidyr)
library(dplyr)
library(lubridate)
options(stringsAsFactors = FALSE)
dat <- structure(list(PersID = c("am1", "am2", "am2", "am3", "am3", "am4", "am4", "am4", "am4", "am4", "am4"), Sex = c("M", "F","F", "M", "M", "F", "F", "F", "F", "F", "F"), DateTested = c("21/10/2015", "9/07/2010", "24/09/2010", "23/10/2013", "25/10/2013", "28/04/2010", "23/06/2010", "21/07/2010", "20/10/2010", "4/03/2011", "2/12/2011"), Res = c("NR", "R", "R", "NR", "R", "R", "R", "R", "R", "R", "R"), Status = c("Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No"), DateOrder = c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("PersID", "Sex", "DateTested", "Res", "Status", "DateOrder"), class = "data.frame", row.names = c(NA, -11L))
The data describes three types of person (1)those with a single result only (2) those with 2 results, and (3) those with many results.
My goal is to come up with a script that will only include rows for individuals according to a set of criteria. Technically it is a method to only count rows for individuals if their subsequent results are within a specified reinfection period (30 days).
I have converted my data to a list and passed a number of functions to it to start processing the data.
dat$DateTested <- dmy(dat$DateTested)
datList <- dlply(.data=dat, .variables=c('PersID'))
What I have done so far is:
Select all rows where there is a single result per person
fnSingleTests <- function(y){
y <- y[length(y$DateOrder)==1,]
}
singleTests <- ldply(datList, fnSingleTests, .id = NULL)
Convert the data frame to a list and pass a function that determines if (a) there are two rows per person within the 30-day reinfection period, then select the first one, and (b) if there are more than two rows per person, and the last record and the first record are within 30 days, only keep the first one.
fnMultiTests <- function(y){
y <- y[length(y$DateOrder) > 1,]
}
multiTests <- llply(datList, fnMultiTests)
fnMultiTestsSplit <- function(y){
test <- difftime(y$DateTested[length(y$DateTested)], y$DateTested[1], units='days')
if (nrow(y) <=2){
if (test < 31){
y <- y[y$DateOrder == 1, ]
y <- y[!is.na(y$PerdID), ]
} else {
y <- y[y$DateOrder %in% 1:2, ]
y <- y[!is.na(y$PersID), ]
}
} else {
if (test < 31){
y <- y[y$DateOrder == 1, ]
y <- y[!is.na(y$PersID), ]
} else {
break()
}
}
}
finalTests <- ldply(multiTests, failwith(NULL, fnMultiTestsSplit, quiet = TRUE), .id = NULL)
I can then combine data frames with rbind:
allFinalTests <- rbind(singleTests, finalTests)
Where I am stuck is for cases where there are more than two rows per person, and within sequential rows there may be cases of a period of time greater than the 30-day reinfection period.
Can anyone suggest how I could extend this code to include only cases where there are more than two PersID
and then only include results where there are subsequent cases occur outside the 30 day reinfection period.
Specifically, start from the oldest case and if the next case is within 30 days then exclude the second cases, or if the second case is more than 30 days since the previous case, then include both cases. It should do this for all cases for the same PersID
In this example the final output I am looking for is:
PersID Sex DateTested Res Status DateOrder
am1 M 21/10/2015 NR Yes 1
am2 F 9/07/2010 R No 1
am2 F 24/09/2010 R No 2
am3 M 23/10/2013 NR Yes 1
am4 F 28/04/2010 R No 1
am4 F 23/06/2010 R No 2
am4 F 20/10/2010 R No 4
am4 F 4/03/2011 R No 5
am4 F 2/12/2011 R No 6