Inclusion/Exclusion of rows in a DataFrame, based on specific criteria

Question

I have a large set of data that contains pathology test data for a number of individuals. I present a scaled down data set describing the types of cases.

library(plyr)
library(tidyr)
library(dplyr)
library(lubridate)

options(stringsAsFactors = FALSE)
dat <- structure(list(PersID = c("am1", "am2", "am2", "am3", "am3", "am4", "am4", "am4", "am4", "am4", "am4"), Sex = c("M", "F","F", "M", "M", "F", "F", "F", "F", "F", "F"), DateTested = c("21/10/2015", "9/07/2010", "24/09/2010", "23/10/2013", "25/10/2013", "28/04/2010", "23/06/2010", "21/07/2010", "20/10/2010", "4/03/2011", "2/12/2011"), Res = c("NR", "R", "R", "NR", "R", "R", "R", "R", "R", "R", "R"), Status = c("Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No"), DateOrder = c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("PersID", "Sex", "DateTested", "Res", "Status", "DateOrder"), class = "data.frame", row.names = c(NA, -11L))

The data describes three types of person (1)those with a single result only (2) those with 2 results, and (3) those with many results.

My goal is to come up with a script that will only include rows for individuals according to a set of criteria. Technically it is a method to only count rows for individuals if their subsequent results are within a specified reinfection period (30 days).

I have converted my data to a list and passed a number of functions to it to start processing the data.

dat$DateTested <- dmy(dat$DateTested)
datList <- dlply(.data=dat, .variables=c('PersID'))

What I have done so far is:

Select all rows where there is a single result per person

fnSingleTests <- function(y){
    y <- y[length(y$DateOrder)==1,]
}

singleTests <- ldply(datList, fnSingleTests, .id = NULL)

Convert the data frame to a list and pass a function that determines if (a) there are two rows per person within the 30-day reinfection period, then select the first one, and (b) if there are more than two rows per person, and the last record and the first record are within 30 days, only keep the first one.

fnMultiTests <- function(y){
    y <- y[length(y$DateOrder) > 1,]
}

multiTests <- llply(datList, fnMultiTests)

fnMultiTestsSplit <- function(y){

    test <- difftime(y$DateTested[length(y$DateTested)], y$DateTested[1], units='days')


    if (nrow(y) <=2){

        if (test < 31){
            y <- y[y$DateOrder == 1, ]
            y <- y[!is.na(y$PerdID), ]
        } else {
            y <- y[y$DateOrder %in% 1:2, ]
            y <- y[!is.na(y$PersID), ]
        }

    } else  {
        if (test < 31){
            y <- y[y$DateOrder == 1, ]
            y <- y[!is.na(y$PersID), ]
        } else {
            break()
        }

    }
}

finalTests <-  ldply(multiTests, failwith(NULL, fnMultiTestsSplit, quiet = TRUE), .id = NULL)

I can then combine data frames with rbind:

allFinalTests <- rbind(singleTests, finalTests)

Where I am stuck is for cases where there are more than two rows per person, and within sequential rows there may be cases of a period of time greater than the 30-day reinfection period.

Can anyone suggest how I could extend this code to include only cases where there are more than two PersID and then only include results where there are subsequent cases occur outside the 30 day reinfection period.

Specifically, start from the oldest case and if the next case is within 30 days then exclude the second cases, or if the second case is more than 30 days since the previous case, then include both cases. It should do this for all cases for the same PersID

In this example the final output I am looking for is:

PersID  Sex DateTested  Res Status  DateOrder
am1 M   21/10/2015  NR  Yes 1
am2 F   9/07/2010   R   No  1
am2 F   24/09/2010  R   No  2
am3 M   23/10/2013  NR  Yes 1
am4 F   28/04/2010  R   No  1
am4 F   23/06/2010  R   No  2
am4 F   20/10/2010  R   No  4
am4 F   4/03/2011   R   No  5
am4 F   2/12/2011   R   No  6

Jaap · Accepted Answer · 2016-01-25T14:45:45.487

In base R, I would approach it as follows:

# convert the 'DateTested' column to a date-format
dat$DateTested <- as.Date(dat$DateTested, format = "%d/%m/%Y")
# calculate the difference in days with the previous observation in the group
dat$tdiff <- unlist(tapply(dat$DateTested, INDEX = dat$PersID,
                           FUN = function(x) c(0, `units<-`(diff(x), "days"))))
# filter the observations that have either a timedifference of zero or more 
dat[(dat[,"tdiff"]==0 | dat[,"tdiff"] > 30),]

which gives:

   PersID Sex DateTested Res Status DateOrder tdiff
1     am1   M 2015-10-21  NR    Yes         1     0
2     am2   F 2010-07-09   R     No         1     0
3     am2   F 2010-09-24   R     No         2    77
4     am3   M 2013-10-23  NR    Yes         1     0
6     am4   F 2010-04-28   R     No         1     0
7     am4   F 2010-06-23   R     No         2    56
9     am4   F 2010-10-20   R     No         4    91
10    am4   F 2011-03-04   R     No         5   135
11    am4   F 2011-12-02   R     No         6   273

Using the data.table package:

library(data.table)
# convert the 'data.frame' to a 'data.table'
# and convert the 'DateTested' column to a date-format
setDT(dat)[, DateTested := as.Date(DateTested, format = "%d/%m/%Y")]
# calculate the difference in days with the previous observation in the group
dat[, tdiff := c(0, `units<-`(diff(DateTested), "days")), PersID]
# filter the observations that have either a timedifference of zero or more than 30 days
dat[(tdiff==0 | tdiff > 30)]

which will give you the same result. You can also chain this together as follows:

setDT(dat)[, DateTested := as.Date(DateTested, format = "%d/%m/%Y")
           ][, tdiff := c(0, `units<-`(diff(DateTested), "days")), by = PersID
             ][(tdiff==0 | tdiff > 30)]

And using dplyr:

library(dplyr)
dat %>% 
  mutate(DateTested = as.Date(DateTested, format = "%d/%m/%Y")) %>%
  group_by(PersID) %>%
  mutate(tdiff = c(0, `units<-`(diff(DateTested), "days"))) %>%
  filter(tdiff == 0 | tdiff > 30)

which will also give you the same result.

This is an extremely elegant and thorough solution. I have gone with the dplyr version. I really appreciate this. — John, Jan 16 '16 at 18:15
SBista's [proposal](https://stackoverflow.com/a/46954730/152860) is, I think more succinct, and elegant. The OP is clearly using `dplyr` so there is a fair amount of cruft in this answer. — Hedgehog, Dec 21 '17 at 01:28

score 0 · Answer 2 · answered Oct 28 '17 at 08:28

With version 1.9.8 (on CRAN 25 Nov 2016), the data.table package has gained the inrange() function which performs a range join making use of non-equi joins.

With inrange() or the %inrange% operator, resp., the expected result can be achieved with

library(data.table) # CRAN version 1.10.4-2 used
data.table(dat)[, DateTested := as.IDate(DateTested, "%d/%m/%Y")][
  , .SD[!DateTested %inrange% list(DateTested + 1L, DateTested + 30L)], by = PersID]

   PersID Sex DateTested Res Status DateOrder
1:    am1   M 2015-10-21  NR    Yes         1
2:    am2   F 2010-07-09   R     No         1
3:    am2   F 2010-09-24   R     No         2
4:    am3   M 2013-10-23  NR    Yes         1
5:    am4   F 2010-04-28   R     No         1
6:    am4   F 2010-06-23   R     No         2
7:    am4   F 2010-10-20   R     No         4
8:    am4   F 2011-03-04   R     No         5
9:    am4   F 2011-12-02   R     No         6

For each PersID, it is looked any other entries which fall in the date range [next day, 30 days later]. These are excluded from the result.

The excluded rows can be shown by:

data.table(dat)[, DateTested := as.IDate(DateTested, "%d/%m/%Y")][
  , .SD[DateTested %inrange% list(DateTested + 1L, DateTested + 30L)], by = PersID]

   PersID Sex DateTested Res Status DateOrder
1:    am3   M 2013-10-25   R    Yes         2
2:    am4   F 2010-07-21   R     No         3

Inclusion/Exclusion of rows in a DataFrame, based on specific criteria

2 Answers2

Linked