0

I have data I need to clean, but not sure how. I need to remove all records that occured less than 7 days after the last observation, excluding those that need to be removed.

Data example:

library(dplyr)
library(lubridate)
df = data.frame(id = c(rep(1,5), rep(2,3)),
                date = c(ymd("2022-01-01"), ymd("2022-01-03"), ymd("2022-01-05"), ymd("2022-01-09"), ymd("2022-01-20"),
                         ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
  arrange(id, date)

  id       date
1  1 2022-01-01
2  1 2022-01-03
3  1 2022-01-05
4  1 2022-01-09
5  1 2022-01-20
6  2 2022-01-02
7  2 2022-01-03
8  2 2022-01-09

And I want it to look like this

  id       date
1  1 2022-01-01
2  1 2022-01-09
3  1 2022-01-20
4  2 2022-01-02
5  2 2022-01-09

I tried using filter() and lag(), but they alone do not quite do it:

df %>% 
  group_by(id) %>% 
  mutate(prev = lag(date + days(7))) %>% 
  ungroup() %>% 
  filter(is.na(prev) | (date - prev >= 0))

     id       date      
1     1 2022-01-01
2     1 2022-01-20
3     2 2022-01-02
Darmist
  • 50
  • 9
  • Do you want the first day per id, and after that every seventh day? – cliffhanger-be Aug 05 '22 at 12:18
  • 2
    Are you looking for something similar to [this](https://stackoverflow.com/questions/39317354/how-to-filter-rows-based-on-difference-in-dates-between-rows-in-r)? – Ben Aug 05 '22 at 12:32
  • @cliffhanger-be Not necessarily every seventh, but the time difference should be at x days – Darmist Aug 05 '22 at 12:41
  • @Ben Not quite. I do not need to compare everything. Observations come in order, and those that occur less than x days from the last one are trash. The point is, that x days should not take this trash into account – Darmist Aug 05 '22 at 12:47
  • @Ben Sorry, was in a hurry and skimmed it on the phone, it does look pretty similar now, I will try that, thanks – Darmist Aug 05 '22 at 13:17

2 Answers2

0

The following code worked well for me.

library(dplyr)
library(lubridate)

df = data.frame(id = c(rep(1,5), rep(2,3)),
                date = c(seq.Date(from = ymd("2022-01-01"), to = ymd("2022-01-15"), by = "weeks"), ymd("2022-01-03"), ymd("2022-01-05"),
                         ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
  arrange(id, date)


df %>% 
  group_by(id) %>% 
  mutate(k = first(date)) %>% 
  ungroup() %>% 
  mutate(l = as.numeric(date-k)) %>% 
  filter(l%%7 == 0) %>% 
  select(id, date)
Melih
  • 26
  • 1
  • It does not quite do that, probably my example was a bit misleading. I do not need to get all data that was x weeks from the first observation. I need to remove observations that happened less than 7 days before the previous one, excluding removed observations. I updated the example of my data – Darmist Aug 05 '22 at 13:07
0

As Ben suggested to look at this question, it does contain an answer. It did not work for me without any changes though, so I am posting slightly modified aichao's code here

library(rlang)
f <- function(d, ind = 1, minDiff = 7) {
  ind.next <- first(which(difftime(d,d[ind], units="days") >= all_of(minDiff))
  if (is_empty(ind.next))
    return(ind)
  else
    return(c(ind, f(d,ind.next)))
}

result <- df %>% 
  group_by(id) %>% 
  slice(f(date)) %>% 
  ungroup()

Darmist
  • 50
  • 9