1

This is actually an adaptation from another question I posted previously. I hope other users' can offer feedback on my code or offer better alternatives to it. Thank you!

I have a dataset that contains lab tests, and I only want lab tests that are 365 days (+/- 90 days) from the first lab test.

  1. If a patient does not have any values that fall within the 365 +/- 90 days from the first date, we only output the first date, like in PATIENT_ID == 1.
  2. I want to output the (i) first date, the (ii) next date within the 365 +/- 90 days range and closest to the 365-day point from the first date, then the (iii) next date within the 365 +/- 90 days range and closest to the 365-day point from the second date and so on. For PATIENT_ID == 2, both 30/05/2016 and 01/08/2016 are within the 365 +/- 90 day-range from the first date, but only the latter is chosen as it is closer to the 365-day mark. The third date 27/07/2017 is chosen because it is within the 365 +/- 90 day-range from the second date and so on.

Data:

    PATIENT_ID LAB_TEST_DATE LAB_TEST
 1:          1    2012-11-19       31
 2:          1    2012-11-21       30
 3:          1    2012-11-23       31
 4:          1    2012-11-26       30
 5:          1    2012-11-28       30
 6:          1    2012-12-01       30
 7:          1    2012-12-05       29
 8:          1    2012-12-06       30
 9:          2    2015-07-23       43
10:          2    2015-08-05       41
11:          2    2015-08-19       44
12:          2    2015-09-02       41
13:          2    2015-09-30       40
14:          2    2015-12-23       45
15:          2    2016-03-16       46
16:          2    2016-05-30       40
17:          2    2016-08-01       46
18:          2    2017-07-27       44
19:          2    2018-10-15       49
20:          3    2011-08-11       30

...trunc...

Desired Output:

PATIENT_ID  LAB_TEST_DATE   LAB_TEST
         1     19/11/2012   31
         2     23/07/2015   43
         2     01/08/2016   46
         2     27/07/2017   44
         2     15/10/2018   49
         3     11/08/2011   30
         3     13/08/2012   36
         4     01/10/2014   41
         4     26/08/2015   42

dput data:

df <- structure(list(PATIENT_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), 
LAB_TEST_DATE = structure(c(15663, 15665, 15667, 15670, 15672, 
15675, 15679, 15680, 16639, 16652, 16666, 16680, 16708, 16792, 
16876, 16951, 17014, 17374, 17819, 15197, 15202, 15217, 15300, 
15335, 15357, 15405, 15413, 15434, 15453, 15565, 16344, 16352, 
16364, 16379, 16414, 16442, 16505, 16589, 16673), class = "Date"), 
LAB_TEST = c(31L, 30L, 31L, 30L, 30L, 30L, 29L, 30L, 43L, 
41L, 44L, 41L, 40L, 45L, 46L, 40L, 46L, 44L, 49L, 30L, 31L, 
34L, 34L, 36L, 36L, 33L, 36L, 33L, 35L, 36L, 41L, 43L, 43L, 
40L, 39L, 42L, 40L, 40L, 42L)), class = "data.frame", .Names = c("PATIENT_ID", 
"LAB_TEST_DATE", "LAB_TEST"), row.names = c(NA, -39L))

Code:

I wrote a recursive function such that if the date is within the range and is the closest to the 365-day mark, then I will filter that date.

f <- function(d, ind = 1) {

  datediff <- difftime(d, d[ind], units = "days")
  ind.range <- which(datediff >= 275 & datediff <= 455)
  ind.min <- which.min(abs(datediff - 365))
  ind.next <- first(intersect(ind.range, ind.min))

  if (is.na(ind.next))
    return(ind)
  else
    return(c(ind, f(d, ind.next)))
}

df %>% group_by(PATIENT_ID) %>% slice(f(LAB_TEST_DATE))
HNSKD
  • 1,614
  • 2
  • 14
  • 25
  • I am quite confused with `365 days (+/- 90 days)` terminology. Can you explain the output for Patient ID = 1 where all the other dates falls in the range , right? – Ronak Shah May 25 '18 at 06:59
  • @RonakShah For Patient ID == 1, we only filter the first date since the other dates are outside of the (365-90, 365+90)-range from the first date. For instance, the first date is 19/11/2012, and the last date is 06/12/2012, and they are only 17 days apart (i.e. not inside (365-90, 365+90)-range). As for Patient ID = 2, both 30/05/2016 and 01/08/2016 are 312 and 375 days apart from the first date (i.e. 23/07/2015), so they are within the range. I filter the later since it is closer to the 365 day-mark (only 10 days away) – HNSKD May 25 '18 at 07:06
  • 1
    Why does your example have more than one per patient when you say "we only output the date closest to the 365-day point" – zacdav May 25 '18 at 07:12
  • @zacdav I made the changes so that it is clearer. – HNSKD May 25 '18 at 07:33
  • What's wrong with your existing code? Too slow? Fails in some cases? –  May 25 '18 at 07:53

1 Answers1

0

Here is a solution using data.table. Explanation inline.

library(data.table)
setDT(df)

#extract the first visit for each patient
firstDates <- df[, .SD[1L], by=PATIENT_ID]

#create the period for each lab test
df[, ':=' (STARTDATE=LAB_TEST_DATE+365-90, ENDDATE=LAB_TEST_DATE+365+90)]

#for each lab test, find the lab tests that are within 365 +/- 90 days 
#after that lab test by performing a non-equi self join
withinPeriod <- df[
    df, 
    .(PATIENT_ID, x.LAB_TEST, x.LAB_TEST_DATE, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST), 
    by=.EACHI,
    on=.(PATIENT_ID, LAB_TEST_DATE >= STARTDATE, LAB_TEST_DATE <= ENDDATE)][
        !is.na(x.LAB_TEST), -3L:-1L]

#find the lab test that is closest to the 365 days after that lab test 
#and extract only relevant columns
selected <- withinPeriod[, .SD[which.min(abs(i.LAB_TEST_DATE + 365 - x.LAB_TEST_DATE))], 
    by=.(PATIENT_ID, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST)][, 
        .(PATIENT_ID, LAB_TEST_DATE=x.LAB_TEST_DATE, LAB_TEST=x.LAB_TEST)]

#cbind first dates with those selected
ans <- rbindlist(list(firstDates, unique(selected)), use.names=TRUE)
setorder(ans, PATIENT_ID, LAB_TEST_DATE)
ans

#   PATIENT_ID LAB_TEST_DATE LAB_TEST
#1:          1    2012-11-19       31
#2:          2    2015-07-23       43
#3:          2    2016-08-01       46
#4:          2    2017-07-27       44
#5:          2    2018-10-15       49
#6:          3    2011-08-11       30
#7:          3    2012-08-13       36
#8:          4    2014-10-01       41
#9:          4    2015-08-26       42
chinsoon12
  • 25,005
  • 4
  • 25
  • 35