Is there a way to remove duplicates but add the value if it appears at least once in the n selected columns?

Question

For example imagine there is a dataset that looks like this

Edit:Added Date and Num column for extra context

ID|Date  |Col1|Col2|Col3|Num
 1  10-10  Y             5    
 1  10-10  Y    Y        5 
 1  10-10            Y   5
 2  09-17  Y             6
 2  09-17            Y   6
 3  12-14       Y        7
 3  12-14  Y             7 
 4  06-06  Y             8
 4  06-06

Is there a way to have an output that looks like this?

ID|Date  |Col1|Col2|Col3|Num
 1  10-10  Y    Y    Y   5    
 2  09-17  Y         Y   6
 3  12-14  Y    Y        7
 4  06-06  Y             8

structure(list(ID = c("000001", "000001", "000001", "000001", 
"000001", "000001", "000001", "000001", "000001", "000001", "000002", 
"000002", "000002", "000003", "000003", "000003", "000003", "000003", 
"000003", "000003"), Date = structure(c(1570492800, 1570492800, 
1570492800, 1570492800, 1570492800, 1570492800, 1570492800, 1570492800, 
1570492800, 1570492800, 1570665600, 1570665600, 1570665600, 1570838400, 
1570838400, 1570838400, 1570838400, 1570838400, 1570838400, 1570838400
), tzone = "UTC", class = c("POSIXct", "POSIXt")), Col1 = c(NA, 
NA, NA, NA, "Y", NA, NA, "Y", NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA), Col2 = c("Y", "Y", "Y", "Y", "Y", "Y", "Y", 
"Y", "Y", "Y", "Y", "Y", "Y", NA, NA, NA, NA, "Y", "Y", "Y"), 
    Col3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, "Y", NA, NA, "Y", "Y", "Y"), Num1 = c(1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 861, 861, 861, 497, 497, 497, 497, 497, 
    497, 497)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

akrun · Accepted Answer · 2022-09-02T17:34:03.703

2

We could group by 'ID', 'Date' and summarise the rest of the columns (everything()) by looping across them and extracting the first non-NA element

library(dplyr)
df2 %>%  
   group_by(ID, Date) %>%
   summarise(across(everything(),  ~.x[!is.na(.x)][1]),
     .groups = 'drop')

-output

# A tibble: 3 × 6
  ID     Date                Col1  Col2  Col3   Num1
  <chr>  <dttm>              <chr> <chr> <chr> <dbl>
1 000001 2019-10-08 00:00:00 Y     Y     <NA>      1
2 000002 2019-10-10 00:00:00 <NA>  Y     <NA>    861
3 000003 2019-10-12 00:00:00 <NA>  Y     Y       497

edited Sep 02 '22 at 17:34

answered Sep 02 '22 at 16:00

akrun

874,273
37
540
662

So my example didn't take into account that there are actually more columns in the data frame. I apologize for that. I basically want it to do the same as shown, but take into account the Date and Num column. I tried `summarise(across(df[,c"col1,col2,col3")]))` but I got error message saying character string is not in a standard unambigous format. – user35131 Sep 02 '22 at 16:28
same with the new code, but don't know what date should affect anything. – user35131 Sep 02 '22 at 16:32
it has nothing to do with date being type posixct? – user35131 Sep 02 '22 at 16:42
Note the actual dataset I'm working in could not be shared so I created a dummy version of it that is slightly different than the example shown before. However same idea. – user35131 Sep 02 '22 at 16:55
The dput I added is a much smaller subset of the actual dataset. – user35131 Sep 02 '22 at 16:59
@user35131 I think it is `na_if("")` causing the problem. If you remove it, works – akrun Sep 02 '22 at 17:00
@user35131 with your dput data, it is working fine though after removing the `na_if` – akrun Sep 02 '22 at 17:02

score 1 · Answer 2 · answered Sep 02 '22 at 18:37

1

A base R option with aggregate

      ID       Date Col1 Col2 Col3 Num1
1 000001 2019-10-08    Y    Y         1
2 000002 2019-10-10         Y       861
3 000003 2019-10-12         Y    Y  497

answered Sep 02 '22 at 18:37

ThomasIsCoding

96,636
9
24
81

Is there a way to remove duplicates but add the value if it appears at least once in the n selected columns?

2 Answers2