This is actually an adaptation from another question I posted previously. I hope other users' can offer feedback on my code or offer better alternatives to it. Thank you!
I have a dataset that contains lab tests, and I only want lab tests that are 365 days (+/- 90 days) from the first lab test.
- If a patient does not have any values that fall within the 365 +/- 90 days from the first date, we only output the first date, like in
PATIENT_ID == 1
. - I want to output the (i) first date, the (ii) next date within the 365 +/- 90 days range and closest to the 365-day point from the first date, then the (iii) next date within the 365 +/- 90 days range and closest to the 365-day point from the second date and so on. For
PATIENT_ID == 2
, both30/05/2016
and01/08/2016
are within the 365 +/- 90 day-range from the first date, but only the latter is chosen as it is closer to the 365-day mark. The third date27/07/2017
is chosen because it is within the 365 +/- 90 day-range from the second date and so on.
Data:
PATIENT_ID LAB_TEST_DATE LAB_TEST
1: 1 2012-11-19 31
2: 1 2012-11-21 30
3: 1 2012-11-23 31
4: 1 2012-11-26 30
5: 1 2012-11-28 30
6: 1 2012-12-01 30
7: 1 2012-12-05 29
8: 1 2012-12-06 30
9: 2 2015-07-23 43
10: 2 2015-08-05 41
11: 2 2015-08-19 44
12: 2 2015-09-02 41
13: 2 2015-09-30 40
14: 2 2015-12-23 45
15: 2 2016-03-16 46
16: 2 2016-05-30 40
17: 2 2016-08-01 46
18: 2 2017-07-27 44
19: 2 2018-10-15 49
20: 3 2011-08-11 30
...trunc...
Desired Output:
PATIENT_ID LAB_TEST_DATE LAB_TEST
1 19/11/2012 31
2 23/07/2015 43
2 01/08/2016 46
2 27/07/2017 44
2 15/10/2018 49
3 11/08/2011 30
3 13/08/2012 36
4 01/10/2014 41
4 26/08/2015 42
dput data:
df <- structure(list(PATIENT_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L),
LAB_TEST_DATE = structure(c(15663, 15665, 15667, 15670, 15672,
15675, 15679, 15680, 16639, 16652, 16666, 16680, 16708, 16792,
16876, 16951, 17014, 17374, 17819, 15197, 15202, 15217, 15300,
15335, 15357, 15405, 15413, 15434, 15453, 15565, 16344, 16352,
16364, 16379, 16414, 16442, 16505, 16589, 16673), class = "Date"),
LAB_TEST = c(31L, 30L, 31L, 30L, 30L, 30L, 29L, 30L, 43L,
41L, 44L, 41L, 40L, 45L, 46L, 40L, 46L, 44L, 49L, 30L, 31L,
34L, 34L, 36L, 36L, 33L, 36L, 33L, 35L, 36L, 41L, 43L, 43L,
40L, 39L, 42L, 40L, 40L, 42L)), class = "data.frame", .Names = c("PATIENT_ID",
"LAB_TEST_DATE", "LAB_TEST"), row.names = c(NA, -39L))
Code:
I wrote a recursive function such that if the date is within the range and is the closest to the 365-day mark, then I will filter that date.
f <- function(d, ind = 1) {
datediff <- difftime(d, d[ind], units = "days")
ind.range <- which(datediff >= 275 & datediff <= 455)
ind.min <- which.min(abs(datediff - 365))
ind.next <- first(intersect(ind.range, ind.min))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d, ind.next)))
}
df %>% group_by(PATIENT_ID) %>% slice(f(LAB_TEST_DATE))