1

very grateful for your help trying to group/collapse rows of all columns (all but the two columns I use to group_by) and would like to exclude duplicated strings in the merge (only keep distinct strings and numbers). The df has many many more columns but here is a toy example.

Is there a way to only keep distinct strings when using summarize_all?

df = data.frame(
+     variant = c("1-12345-GT", "1-12345-GT", "2-3456-C-T", "3-45567-C-A", "4-566879-C-T", "4-566879-C-T"),
ID=c("A", "A", "B", "C", "D", "D"),
value1=c( "GJB1", "GJB1", "TBC", "TZY", "FBY", "FBY"),
value2=c( 0.5, 1, 0.5, 0.5, 0.5, 0.5))

df
       variant ID value1 value2
1   1-12345-GT  A   GJB1    0.5
2   1-12345-GT  A   GJB1    1.0
3   2-3456-C-T  B    TBC    0.5
4  3-45567-C-A  C    TZY    0.5
5 4-566879-C-T  D    FBY    0.5
6 4-566879-C-T  D    FBY    0.5

I tried

df %>% group_by(variant, ID) %>% summarise_all(~(toString(na.omit(.)))) 


A tibble: 4 × 4
Groups:   variant [4]
  variant      ID    value1     value2  
  <chr>        <chr> <chr>      <chr>   
1 1-12345-GT   A     GJB1, GJB1 0.5, 1  
2 2-3456-C-T   B     TBC        0.5     
3 3-45567-C-A  C     TZY        0.5     
4 4-566879-C-T D     FBY, FBY   0.5, 0.5

I would like row 1 GJB1 and row 4 FBY and 0.5 to not be duplicated. But I would like to keep both unique values at row 1: 0.5, 1.

I tried splitting the rows. But this does not eliminate GJB1 duplicates since the rows will still be unique.

df %>% separate_rows(value1, value2) %>% unique() %>% group_by(variant, ID) %>% summarise_all(~(toString(na.omit(.)))) %>% ungroup()


A tibble: 4 × 4
  variant      ID    value1     value2
  <chr>        <chr> <chr>      <chr> 
1 1-12345-GT   A     GJB1, GJB1 0.5, 1
2 2-3456-C-T   B     TBC        0.5   
3 3-45567-C-A  C     TZY        0.5   
4 4-566879-C-T D     FBY        0.5   

Thinking there must be a function only to keep distinct strings when using summarize_all. Thank you!

Sanna
  • 11
  • 2

1 Answers1

0

You could use unique():

library(dplyr, warn = FALSE)

df |>
  group_by(variant, ID) |>
  summarise(
    across(everything(), ~ toString(na.omit(unique(.x))))
  )
#> `summarise()` has grouped output by 'variant'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups:   variant [4]
#>   variant      ID    value1 value2
#>   <chr>        <chr> <chr>  <chr> 
#> 1 1-12345-GT   A     GJB1   0.5, 1
#> 2 2-3456-C-T   B     TBC    0.5   
#> 3 3-45567-C-A  C     TZY    0.5   
#> 4 4-566879-C-T D     FBY    0.5
stefan
  • 90,330
  • 6
  • 25
  • 51
  • Thank you so much! This worked perfectly. I tried adding unique but I did not manage to figure out that there needed to be unique(**.x**) - can you explain to me what the .x is referring to? Thank you! – Sanna Jun 02 '23 at 12:34
  • Hi Sanna. `.x` works the same way as `.` which you used, i.e. you can use one or the other. Personally I prefer the `.x`. And both options are special symbols to refer to the function argument when using lambda functions ie. when doing `~ f(.x)` instead of `function(x) f(x)`. – stefan Jun 02 '23 at 12:49