very grateful for your help trying to group/collapse rows of all columns (all but the two columns I use to group_by) and would like to exclude duplicated strings in the merge (only keep distinct strings and numbers). The df has many many more columns but here is a toy example.
Is there a way to only keep distinct strings when using summarize_all?
df = data.frame(
+ variant = c("1-12345-GT", "1-12345-GT", "2-3456-C-T", "3-45567-C-A", "4-566879-C-T", "4-566879-C-T"),
ID=c("A", "A", "B", "C", "D", "D"),
value1=c( "GJB1", "GJB1", "TBC", "TZY", "FBY", "FBY"),
value2=c( 0.5, 1, 0.5, 0.5, 0.5, 0.5))
df
variant ID value1 value2
1 1-12345-GT A GJB1 0.5
2 1-12345-GT A GJB1 1.0
3 2-3456-C-T B TBC 0.5
4 3-45567-C-A C TZY 0.5
5 4-566879-C-T D FBY 0.5
6 4-566879-C-T D FBY 0.5
I tried
df %>% group_by(variant, ID) %>% summarise_all(~(toString(na.omit(.))))
A tibble: 4 × 4
Groups: variant [4]
variant ID value1 value2
<chr> <chr> <chr> <chr>
1 1-12345-GT A GJB1, GJB1 0.5, 1
2 2-3456-C-T B TBC 0.5
3 3-45567-C-A C TZY 0.5
4 4-566879-C-T D FBY, FBY 0.5, 0.5
I would like row 1 GJB1 and row 4 FBY and 0.5 to not be duplicated. But I would like to keep both unique values at row 1: 0.5, 1.
I tried splitting the rows. But this does not eliminate GJB1 duplicates since the rows will still be unique.
df %>% separate_rows(value1, value2) %>% unique() %>% group_by(variant, ID) %>% summarise_all(~(toString(na.omit(.)))) %>% ungroup()
A tibble: 4 × 4
variant ID value1 value2
<chr> <chr> <chr> <chr>
1 1-12345-GT A GJB1, GJB1 0.5, 1
2 2-3456-C-T B TBC 0.5
3 3-45567-C-A C TZY 0.5
4 4-566879-C-T D FBY 0.5
Thinking there must be a function only to keep distinct strings when using summarize_all. Thank you!