0

I'm trying to merge or collapse rows which are partial duplicates based on ID and with varying conditions based on the column. I'd like to end up with a single row per ID in a new dataframe.

This is what my data looks like.

   ID Var1      Date1 Var2      Date2
1   1    A 2021-05-01    X 2021-05-01
2   1    B 2021-06-01    Y 2021-04-01
3   2    B 2021-03-01    X       <NA>
4   2 <NA>       <NA>    X 2021-03-01
5   3    A       <NA>    Y 2021-04-01
6   3 <NA> 2021-02-01 <NA> 2021-05-01
7   4 <NA> 2021-04-01    X       <NA>
8   4 <NA> 2021-03-01    Z       <NA>
9   5    A 2021-02-01    X 2021-03-01
10  5    B 2021-02-01    Y 2021-04-01

I want to group by ID and take the earliest date for each date variable. Then, I want to extract the corresponding values for Var1 based on Date1 and Var2 based on Date 2.

   ID Var1      Date1 Var2      Date2
1   1    A 2021-05-01    Y 2021-04-01
2   2    B 2021-03-01    X 2021-03-01
3   3 <NA> 2021-02-01    Y 2021-04-01
4   4 <NA> 2021-03-01 <NA>       <NA>
5   5    A 2021-02-01    X 2021-03-01
6   5    B 2021-02-01    X 2021-03-01

Here's the sample data.

df <- data.frame (ID = c(1,1,2,2,3,3,4,4,5,5), Var1 = c("A", "B", "B", NA, "A", NA, NA, NA, "A", "B"), Date1 = c("2021-05-01", "2021-06-01", "2021-03-01", NA, NA, "2021-02-01", "2021-04-01", "2021-03-01", "2021-02-01", "2021-02-01"), Var2 = c("X", "Y", "X", "X", "Y", NA, "X", "Z", "X", "Y"), Date2 = c("2021-05-01", "2021-04-01", NA, "2021-03-01", "2021-04-01","2021-05-01", NA, NA, "2021-03-01", "2021-04-01"))

and converting to date variables.

df.test1$P_date <- as.Date(df.test1$P_date, format = "%Y-%m-%d")
df.test1$R_date <- as.Date(df.test1$R_date, format = "%Y-%m-%d")

For ID no.5 I'd like to keep both rows if there if the dates match but the Var1 or Var2 don't match but if this is too hard that's ok.

I got as far as getting the earliest date for each date variable using

df <- df %>% group_by(ID) %>% mutate(earliest_date1 = last(na.omit(Date1, order_by = Date1)
df <- df %>% group_by(ID) %>% mutate(earliest_date2 = last(na.omit(Date2, order_by = Date2)

But I haven't been able to get get the corresponding values from Var1 and Var 2 into a single row in a new dataframe. Tried various things but don't think I'm on the right track at all so any help would be greatly appreciated!

dan1st
  • 12,568
  • 8
  • 34
  • 67
Selena
  • 1

1 Answers1

0

This might be helpful to start with. You can use pivot_longer to put into long format. Then grouping by ID and either 1 or 2, sort by Date and take the first Var and Date.

It is still unclear what the complete logic is for row #5. Perhaps you can explain further?

library(tidyverse)

df %>%
  mutate(across(starts_with("Date"), as.Date)) %>%
  pivot_longer(cols = -ID, names_to = c(".value", "Num"), names_pattern = "(Var|Date)(\\d+)") %>%
  group_by(ID, Num) %>%
  arrange(Date) %>%
  summarise(earliest_date = first(Date), Var = first(Var)) %>%
  pivot_wider(id_cols = ID, names_from = Num, values_from = c(Var, earliest_date))

Output

     ID Var_1 Var_2 earliest_date_1 earliest_date_2
  <dbl> <chr> <chr> <date>          <date>         
1     1 A     Y     2021-05-01      2021-04-01     
2     2 B     X     2021-03-01      2021-03-01     
3     3 NA    Y     2021-02-01      2021-04-01     
4     4 NA    X     2021-03-01      NA             
5     5 A     X     2021-02-01      2021-03-01
Ben
  • 28,684
  • 5
  • 23
  • 45