R: Collapse partial duplicate rows into one row by ID but with different conditions per column

Question

I'm trying to merge or collapse rows which are partial duplicates based on ID and with varying conditions based on the column. I'd like to end up with a single row per ID in a new dataframe.

This is what my data looks like.

   ID Var1      Date1 Var2      Date2
1   1    A 2021-05-01    X 2021-05-01
2   1    B 2021-06-01    Y 2021-04-01
3   2    B 2021-03-01    X       <NA>
4   2 <NA>       <NA>    X 2021-03-01
5   3    A       <NA>    Y 2021-04-01
6   3 <NA> 2021-02-01 <NA> 2021-05-01
7   4 <NA> 2021-04-01    X       <NA>
8   4 <NA> 2021-03-01    Z       <NA>
9   5    A 2021-02-01    X 2021-03-01
10  5    B 2021-02-01    Y 2021-04-01

I want to group by ID and take the earliest date for each date variable. Then, I want to extract the corresponding values for Var1 based on Date1 and Var2 based on Date 2.

   ID Var1      Date1 Var2      Date2
1   1    A 2021-05-01    Y 2021-04-01
2   2    B 2021-03-01    X 2021-03-01
3   3 <NA> 2021-02-01    Y 2021-04-01
4   4 <NA> 2021-03-01 <NA>       <NA>
5   5    A 2021-02-01    X 2021-03-01
6   5    B 2021-02-01    X 2021-03-01

Here's the sample data.

df <- data.frame (ID = c(1,1,2,2,3,3,4,4,5,5), Var1 = c("A", "B", "B", NA, "A", NA, NA, NA, "A", "B"), Date1 = c("2021-05-01", "2021-06-01", "2021-03-01", NA, NA, "2021-02-01", "2021-04-01", "2021-03-01", "2021-02-01", "2021-02-01"), Var2 = c("X", "Y", "X", "X", "Y", NA, "X", "Z", "X", "Y"), Date2 = c("2021-05-01", "2021-04-01", NA, "2021-03-01", "2021-04-01","2021-05-01", NA, NA, "2021-03-01", "2021-04-01"))

and converting to date variables.

df.test1$P_date <- as.Date(df.test1$P_date, format = "%Y-%m-%d")
df.test1$R_date <- as.Date(df.test1$R_date, format = "%Y-%m-%d")

For ID no.5 I'd like to keep both rows if there if the dates match but the Var1 or Var2 don't match but if this is too hard that's ok.

I got as far as getting the earliest date for each date variable using

df <- df %>% group_by(ID) %>% mutate(earliest_date1 = last(na.omit(Date1, order_by = Date1)
df <- df %>% group_by(ID) %>% mutate(earliest_date2 = last(na.omit(Date2, order_by = Date2)

But I haven't been able to get get the corresponding values from Var1 and Var 2 into a single row in a new dataframe. Tried various things but don't think I'm on the right track at all so any help would be greatly appreciated!

score 0 · Answer 1 · answered May 26 '21 at 14:01

This might be helpful to start with. You can use pivot_longer to put into long format. Then grouping by ID and either 1 or 2, sort by Date and take the first Var and Date.

It is still unclear what the complete logic is for row #5. Perhaps you can explain further?

library(tidyverse)

df %>%
  mutate(across(starts_with("Date"), as.Date)) %>%
  pivot_longer(cols = -ID, names_to = c(".value", "Num"), names_pattern = "(Var|Date)(\\d+)") %>%
  group_by(ID, Num) %>%
  arrange(Date) %>%
  summarise(earliest_date = first(Date), Var = first(Var)) %>%
  pivot_wider(id_cols = ID, names_from = Num, values_from = c(Var, earliest_date))

Output

     ID Var_1 Var_2 earliest_date_1 earliest_date_2
  <dbl> <chr> <chr> <date>          <date>         
1     1 A     Y     2021-05-01      2021-04-01     
2     2 B     X     2021-03-01      2021-03-01     
3     3 NA    Y     2021-02-01      2021-04-01     
4     4 NA    X     2021-03-01      NA             
5     5 A     X     2021-02-01      2021-03-01

R: Collapse partial duplicate rows into one row by ID but with different conditions per column

1 Answers1

Linked