Removing people from a dataframe in R if their results don't span a specific period of time

Question

Dataset of blood results:

        id result date    
        A1 80     01/01/2006
        A1 70     02/10/2006
        A1 61     01/01/2007
        A1 30     01/01/2008
        A1 28     03/06/2008
        B2 40     01/01/2006
        B2 30     01/10/2006
        B2 25     01/01/2015
        B2 10     01/01/2020
        G3 28     01/01/2009
        G3 27     01/01/2014
        G3 25     01/01/2013
        G3 24     01/01/2011
        G3 22     01/01/2019
        U7 20     01/01/2005
        U7 19     01/01/2006
        U7 18     01/04/2006
        U7 18     01/08/2006

I would like to only keep those individuals who have blood results spanning at least a three year period.

I can convert their dates to just years along the line of:

df %>% 
  # Create a column holding year for each ID
  mutate(date = dmy(date)) %>% 
  mutate(year = year(date)) %>% 
  # group by ID
  group_by(ID, year) %>% 
  # find max diff
  summarise(max_diff = max(year) - min(year))

How would I then continue the pipe to remove those with a max diff <3. The desired output from the above example would be:

    id result date    
    B2 40     2006
    B2 30     2006
    B2 25     2015
    B2 10     2020
    G3 28     2009
    G3 27     2014
    G3 25     2013
    G3 24     2011
    G3 22     2019

I would then pipe these people into the answer for this question: Predicting when an output might happen in time in R

Many thanks

score 2 · Accepted Answer · answered Sep 13 '22 at 12:06

Cutting down to year only is not a great move because you'll under/overestimate the range depending on whether people had their first/last blood test early or late in the year (case in point, A1 has blood tests in 3 separate calendar years, but less than 3 years apart overall).

To actually work out the range between first and last blood test, you can use diff(range()). In this case, I'm using tapply.

df <- structure(list(id = c("A1", "A1", "A1", "A1", "A1", "B2", "B2", 
"B2", "B2", "G3", "G3", "G3", "G3", "G3", "U7", "U7", "U7", "U7"
), result = c(80L, 70L, 61L, 30L, 28L, 40L, 30L, 25L, 10L, 28L, 
27L, 25L, 24L, 22L, 20L, 19L, 18L, 18L), date = structure(c(13149, 
13423, 13514, 13879, 14033, 13149, 13422, 16436, 18262, 14245, 
16071, 15706, 14975, 17897, 12784, 13149, 13239, 13361), class = "Date")), 
row.names = c(NA, -18L), class = "data.frame")

ranges <- tapply(df$date, df$id, function(dates) diff(range(dates))/365.25)
ranges

       A1        B2        G3        U7 
 2.420260 13.998631  9.998631  1.579740 

keep <- names(ranges)[ranges>=3]
keep
[1] "B2" "G3"

You can flip this to use pipes if that is your preferred syntax.

Many thanks, I have accepted your answer based of your point regarding the 3 year cutoff. This worked well. Just to check when you call "ranges" using tapply, does the df$id call group them by id? — tacrolimus, Sep 13 '22 at 13:28
Yes, the `tapply` call means "take column df$date, group by df$id, for each id apply the function `diff(range(dates))/365.25`" - see `?tapply`. You can do similar things with `aggregate` or dplyr's `group_by` followed by a summarise function or similar. — Ottie, Sep 13 '22 at 13:37

score 1 · Answer 2 · answered Sep 13 '22 at 11:22

Does this work:

library(dplyr)
library(lubridate)

df %>% mutate(year = year(dmy(date))) %>% group_by(id) %>% 
                                     filter(length(unique(year)) > 2)
# A tibble: 14 × 4
# Groups:   id [3]
   id    result date        year
   <chr>  <dbl> <chr>      <dbl>
 1 A1        80 01/01/2006  2006
 2 A1        70 02/10/2006  2006
 3 A1        61 01/01/2007  2007
 4 A1        30 01/01/2008  2008
 5 A1        28 03/06/2008  2008
 6 B2        40 01/01/2006  2006
 7 B2        30 01/10/2006  2006
 8 B2        25 01/01/2015  2015
 9 B2        10 01/01/2020  2020
10 G3        28 01/01/2009  2009
11 G3        27 01/01/2014  2014
12 G3        25 01/01/2013  2013
13 G3        24 01/01/2011  2011
14 G3        22 01/01/2019  2019

Removing people from a dataframe in R if their results don't span a specific period of time

2 Answers2