Remove the first row from each group if the second row meets a condition

Question

Here's a sample of my dataset:

df=data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
  Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
    "6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"), 
  Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"))

I need to calculate the difference between dates which I have already done and the dataset then looks like:

| id |    Date    | Buyer | diff |
|----|:----------:|------:|------|
| 9  | 11/29/2018 |  John | NA   |
| 9  | 11/29/2018 |  John | 0    |
| 9  | 11/29/2018 |  John | 0    |
| 5  | 5/25/2018  | Maria | -188 |
| 5  | 2/13/2019  | Maria | 264  |
| 5  | 2/13/2019  | Maria | 0    |
| 4  | 6/7/2018   | Sandy | -251 |
| 4  | 6/15/2018  | Sandy | 8    |
| 4  | 6/20/2018  | Sandy | 5    |
| 4  | 8/17/2018  | Sandy | 58   |
| 4  | 8/20/2018  | Sandy | 3    |
| 20 | 12/25/2018 | Paul  | 127  |
| 20 | 12/25/2018 | Paul  | 0    |

Now, if the value of second row within each group of column 'diff' is greater than or equal to 5, then I need to delete the first row of each group. For example, the diff value 264 is greater than 5 for Buyer 'Maria' having id '5', so I would want to delete the first row within that group which would be the buyer 'Maria' having id '5', Date as '5/25/2018', and diff as '-188'

Below is a sample of my code:

df1=df %>% group_by(Buyer,id) %>%
  mutate(diff = c(NA, diff(Date))) %>%
  filter(!(diff >=5 & row_number() == 1))

The problem is that the above code selects the first row instead of the second row and I don't know how to specify the row to be 2nd for each group where the diff value should be greater than or equal to 5.

My expected output should look like:

| id |    Date    | Buyer | diff |
|----|:----------:|------:|------|
| 9  | 11/29/2018 |  John | NA   |
| 9  | 11/29/2018 |  John | 0    |
| 9  | 11/29/2018 |  John | 0    |
| 5  | 2/13/2019  | Maria | 264  |
| 5  | 2/13/2019  | Maria | 0    |
| 4  | 6/15/2018  | Sandy | 8    |
| 4  | 6/20/2018  | Sandy | 5    |
| 4  | 8/17/2018  | Sandy | 58   |
| 4  | 8/20/2018  | Sandy | 3    |
| 20 | 12/25/2018 | Paul  | 127  |
| 20 | 12/25/2018 | Paul  | 0    |

Dates in `df` does not match with your printed data. Also add `diff` column to `df`. Maybe name it `diffs` while you are at it to avoid conflict with `diff()`. — Shree, Sep 13 '19 at 14:57
@Shree I have corrected that. My original dataset did not have the diff column. I need to creat that later, which I have specified in the code. — hk2, Sep 13 '19 at 15:05
From our perspective it is just input which you have already calculated so you should include in the data or include code to calculate it. Anyways, doesn't matter now. See if the answers below help. — Shree, Sep 13 '19 at 15:07

Shree · Accepted Answer · 2019-09-13T15:16:21.913

I think you forgot to provide the diff column in df. I created one called diffs so that it doesn't conflict with the function diff(). -

library(dplyr)

df1 %>% 
  group_by(id) %>% 
  mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y")))) %>% 
  filter(
    n() == 1 |         # always keep if only one row in group
    row_number() > 1 | # always keep all row_number() > 1
    diffs[2] < 5       # keep 1st row only if 2nd row diffs < 5
  ) %>% 
  ungroup()

# A tibble: 11 x 4
   id    Date       Buyer diffs
   <chr> <chr>      <chr> <dbl>
 1 9     11/29/2018 John     NA
 2 9     11/29/2018 John      0
 3 9     11/29/2018 John      0
 4 5     2/13/2019  Maria   264
 5 5     2/13/2019  Maria     0
 6 4     6/15/2018  Sandy     8
 7 4     6/20/2018  Sandy     5
 8 4     8/17/2018  Sandy    58
 9 4     8/20/2018  Sandy     3
10 20    12/25/2018 Paul     NA
11 20    12/25/2018 Paul      0

Data -

I added stringsAsFactors = FALSE

df1 <- data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
  Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
    "6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"), 
  Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul")
  , stringsAsFactors = F)

@Shree Could you please explain me the logic used in your code? — hk2, Sep 13 '19 at 15:15
@hk2 just added explanation as code comments. To visualize, create three columns with one filter condition each and apply `OR` along them. — Shree, Sep 13 '19 at 15:21

score 0 · Answer 2 · answered Sep 13 '19 at 14:58

Maybe I overthought it, but here is one idea,

df8 %>% 
 mutate(Date = as.Date(Date, format = '%m/%d/%Y')) %>% 
 mutate(diff = c(NA, diff(Date))) %>% 
 group_by(id) %>% 
 mutate(diff1 = as.integer(diff >= 5) + row_number()) %>% 
 filter(diff1 != 1 | lead(diff1) != 3) %>% 
 select(-diff1)

which gives,

# A tibble: 11 x 4
# Groups:   id [4]
   id    Date       Buyer  diff
   <fct> <date>     <fct> <dbl>
 1 9     2018-11-29 John     NA
 2 9     2018-11-29 John      0
 3 9     2018-11-29 John      0
 4 5     2019-02-13 Maria   264
 5 5     2019-02-13 Maria     0
 6 4     2018-06-15 Sandy     8
 7 4     2018-06-20 Sandy     5
 8 4     2018-08-17 Sandy    58
 9 4     2018-08-20 Sandy     3
10 20    2018-12-25 Paul    127
11 20    2018-12-25 Paul      0

Remove the first row from each group if the second row meets a condition

2 Answers2

Linked