2

Lets assume, my data is like

  group_id    col1
1        1     A,B
2        1     B,C
3        2     A,C
4        2     B,D
5        3     A,D
6        3 A,B,C,D

I would like to summarise/mutate col1 where its elements are intersected within same group (over group_id). My required output is like (if summarising)

  group_id col1
1        1    B
2        2 <NA>
3        3  A,D

or like this (if mutating)

  group_id col1
1        1    B
2        1    B
3        2 <NA>
4        2 <NA>
5        3  A,D
6        3  A,D

I can easily create a union by using function toString but scratching my head for how to have common elements in the output. Basically intersect requires at least two arguments and is therefore not working here.

dput(df) is as under

df <-  structure(list(group_id = c(1L, 1L, 2L, 2L, 3L, 3L), col1 = c("A,B", 
"B,C", "A,C", "B,D", "A,D", "A,B,C,D")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45

3 Answers3

1

You can split col1 on comma and use Reduce + intersect to get common values in each group_id.

library(dplyr)
df %>%
  group_by(group_id) %>%
  summarise(col1 = toString(Reduce(intersect, strsplit(col1, ','))))

#  group_id col1  
#*    <int> <chr> 
#1        1 "B"   
#2        2 ""    
#3        3 "A, D"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks for the response @Ronak. It is working as expected. Let me check other answers before accepting one. – AnilGoyal Mar 05 '21 at 08:04
1

Would this work:

library(dplyr)
library(tidyr)
df %>% separate_rows(col1) %>% 
   group_by(group_id, col1) %>% filter(n()>1) %>% 
   distinct() %>% group_by(group_id) %>% summarise(col1 = toString(col1)) %>% 
   right_join(df %>% select(group_id) %>% distinct()) %>% 
   arrange(group_id)
`summarise()` ungrouping output (override with `.groups` argument)
Joining, by = "group_id"
# A tibble: 3 x 2
  group_id col1 
     <int> <chr>
1        1 B    
2        2 NA   
3        3 A, D 
Karthik S
  • 11,348
  • 2
  • 11
  • 25
1

One option using dplyr and tidyr could be:

df %>%
 separate_rows(col1) %>%
 count(group_id, col1) %>%
 group_by(group_id) %>%
 summarise(col1 = if_else(all(n == 1), NA_character_, paste(col1[n == 2], collapse = ",")))

  group_id col1 
     <int> <chr>
1        1 B    
2        2 <NA> 
3        3 A,D  
tmfmnk
  • 38,881
  • 4
  • 47
  • 67