2
ID Date
101 10-17-2021
101 10-19-2021
101 10-20-2021
101 10-31-2021
101 11-01-2021

For each ID I want to remove observations that are within 7 days of each other. I want to keep the earliest date of the dates that are within 7 days of each other. So in this case I would want to keep "10-17-2021" and "10-31-2021". This process would continue until I have unique dates for each ID that are at least 7 days apart and do not contain other dates in between.

3 Answers3

1

You can do it using group_by() and slice() functions. But first the Date column must be formatted using as.Date() function. Here is the code to remove observations within 7-day interval and keep only the earliest ID:

library(tidyverse)

df$Date <- as.Date(df$Date, format = "%m-%d-%Y")

df %>% 
  group_by(ID) %>% 
  slice(c(1, which(c(0, diff(Date)) >= 7)))

output

ID    Date      
 101 2021-10-17
 101 2021-10-31
S-SHAAF
  • 1,863
  • 2
  • 5
  • 14
1

In your example, you can't evaluate every observation independently because some of them may be removed when compared to the first value. Perhaps I'm not thinking about it the right way, but I think you need a loop to do this. Here's what I came up with (note: I made the sequence of dates longer to make sure it works):

library(dplyr)

d <- tibble(
  ID = 101, 
  Date = seq(lubridate::mdy("01-01-2023"), 
             lubridate::mdy("02-07-2023"), by="days")
)  

i <- 1
while(i < nrow(d)){
  d <- d %>% mutate(diff = Date - d$Date[i])
  d <- d %>% filter(diff <= 0 | diff > 7)
  if(i < nrow(d)){
    i <- i+1
  }
}
d <- d %>% select(-diff)
d
#> # A tibble: 5 × 2
#>      ID Date      
#>   <dbl> <date>    
#> 1   101 2023-01-01
#> 2   101 2023-01-09
#> 3   101 2023-01-17
#> 4   101 2023-01-25
#> 5   101 2023-02-02

Created on 2023-02-08 by the reprex package (v2.0.1)

Essentially, what happens is that the loop initializes with the first observation and removes every observation within seven days. If more observations remain, it increments the counter and moves to the next day and evaluates all subsequent dates from there, keeping everything that came before.

These loops are difficult to do in the tidyverse, but you could split the data by group, run the loop on each group and then put the groups back together. Here's an example:

library(dplyr)

d <- tibble(
  ID = 101, 
  Date = seq(lubridate::mdy("01-01-2023"), 
             lubridate::mdy("02-07-2023"), by="days")
)  

d2 <- d %>% mutate(ID = 102)
alldat <- bind_rows(d, d2)

split_dat <- alldat %>% 
  group_by(ID) %>% 
  group_split()


result <- purrr::map(split_dat, function(d){
  i <- 1
  while(i < nrow(d)){
    d <- d %>% mutate(diff = Date - d$Date[i])
    d <- d %>% filter(diff <= 0 | diff > 7)
    if(i < nrow(d)){
      i <- i+1
    }
  }
  d <- d %>% select(-diff)
  d
  
})

result <- bind_rows(result)

result
#> # A tibble: 10 × 2
#>       ID Date      
#>    <dbl> <date>    
#>  1   101 2023-01-01
#>  2   101 2023-01-09
#>  3   101 2023-01-17
#>  4   101 2023-01-25
#>  5   101 2023-02-02
#>  6   102 2023-01-01
#>  7   102 2023-01-09
#>  8   102 2023-01-17
#>  9   102 2023-01-25
#> 10   102 2023-02-02

Created on 2023-02-08 by the reprex package (v2.0.1)

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
0

You can try using a recursive function as in this answer.

f <- function(d, ind = 1) {
  ind.next <- dplyr::first(which(difftime(d, d[ind], units="days") > 7))
  if (is.na(ind.next))
    return(ind)
  else
    return(c(ind, f(d, ind.next)))
}

After the first date, the function will get the next index ind.next where the date is more than 7 days away. Recursively, add that index and get the next date after that. In the end, just return all the row indexes.

The code to use this function can group_by(ID) and slice to retain those rows based on indexes returned.

library(dplyr)

df %>% 
  group_by(ID) %>% 
  slice(f(Date))
Ben
  • 28,684
  • 5
  • 23
  • 45