Recursive filtering in R

Question

I have data I need to clean, but not sure how. I need to remove all records that occured less than 7 days after the last observation, excluding those that need to be removed.

Data example:

library(dplyr)
library(lubridate)
df = data.frame(id = c(rep(1,5), rep(2,3)),
                date = c(ymd("2022-01-01"), ymd("2022-01-03"), ymd("2022-01-05"), ymd("2022-01-09"), ymd("2022-01-20"),
                         ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
  arrange(id, date)

  id       date
1  1 2022-01-01
2  1 2022-01-03
3  1 2022-01-05
4  1 2022-01-09
5  1 2022-01-20
6  2 2022-01-02
7  2 2022-01-03
8  2 2022-01-09

And I want it to look like this

  id       date
1  1 2022-01-01
2  1 2022-01-09
3  1 2022-01-20
4  2 2022-01-02
5  2 2022-01-09

I tried using filter() and lag(), but they alone do not quite do it:

df %>% 
  group_by(id) %>% 
  mutate(prev = lag(date + days(7))) %>% 
  ungroup() %>% 
  filter(is.na(prev) | (date - prev >= 0))

     id       date      
1     1 2022-01-01
2     1 2022-01-20
3     2 2022-01-02

Do you want the first day per id, and after that every seventh day? — cliffhanger-be, Aug 05 '22 at 12:18
Are you looking for something similar to [this](https://stackoverflow.com/questions/39317354/how-to-filter-rows-based-on-difference-in-dates-between-rows-in-r)? — Ben, Aug 05 '22 at 12:32
@cliffhanger-be Not necessarily every seventh, but the time difference should be at x days — Darmist, Aug 05 '22 at 12:41
@Ben Not quite. I do not need to compare everything. Observations come in order, and those that occur less than x days from the last one are trash. The point is, that x days should not take this trash into account — Darmist, Aug 05 '22 at 12:47
@Ben Sorry, was in a hurry and skimmed it on the phone, it does look pretty similar now, I will try that, thanks — Darmist, Aug 05 '22 at 13:17

score 0 · Answer 1 · answered Aug 05 '22 at 12:45

0

The following code worked well for me.

library(dplyr)
library(lubridate)

df = data.frame(id = c(rep(1,5), rep(2,3)),
                date = c(seq.Date(from = ymd("2022-01-01"), to = ymd("2022-01-15"), by = "weeks"), ymd("2022-01-03"), ymd("2022-01-05"),
                         ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
  arrange(id, date)


df %>% 
  group_by(id) %>% 
  mutate(k = first(date)) %>% 
  ungroup() %>% 
  mutate(l = as.numeric(date-k)) %>% 
  filter(l%%7 == 0) %>% 
  select(id, date)

answered Aug 05 '22 at 12:45

Melih

26
1

It does not quite do that, probably my example was a bit misleading. I do not need to get all data that was x weeks from the first observation. I need to remove observations that happened less than 7 days before the previous one, excluding removed observations. I updated the example of my data – Darmist Aug 05 '22 at 13:07

Darmist · Accepted Answer · 2022-08-05T14:07:26.307

As Ben suggested to look at this question, it does contain an answer. It did not work for me without any changes though, so I am posting slightly modified aichao's code here

library(rlang)
f <- function(d, ind = 1, minDiff = 7) {
  ind.next <- first(which(difftime(d,d[ind], units="days") >= all_of(minDiff))
  if (is_empty(ind.next))
    return(ind)
  else
    return(c(ind, f(d,ind.next)))
}

result <- df %>% 
  group_by(id) %>% 
  slice(f(date)) %>% 
  ungroup()

Recursive filtering in R

2 Answers2