0

I have a dataset of genes which are in numbered groups. Sometimes one gene can appear in multiple groups, and I want to remove a duplicate of a gene if it is appearing in another group and it is the only gene in that group.

For example my data looks like:

Group       Gene
1           Gene1
1           Gene2
1           Gene3
2           Gene3
2           Gene4
3           Gene1
4           Gene10
4           Gene11

Gene1 appears in groups 1 and 3, but in group 3 it is it the only gene in that group so I want to remove that row. How can I code for this condition to then remove duplicates by that filter?

I've seen some similar questions and I've tried repurposing code but I haven't got very far. At the moment I am trying:

library(dplyr)

df %>% 
   group_by(Gene) %>% 
   mutate(duplicates = filter(n() > 1))
   mutate(to_remove = duplicates == TRUE & Group = filter(n() == 1))

But I'm probably not getting the syntax of this right at all, I don't use dplyr a lot.

Example input data:

df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), Gene = c("Gene1", 
"Gene2", "Gene3", "Gene3", "Gene4", "Gene1", "Gene10", "Gene11"
)), row.names = c(NA, -8L), class = c("data.table", "data.frame"
))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
LN3
  • 67
  • 1
  • 2
  • 10

1 Answers1

1

Maybe you can try subset + ave

> subset(
+   df,
+   ave(Gene, Group, FUN = function(x) unique(length(x))) > 1
+ )
  Group   Gene
1     1  Gene1
2     1  Gene2
3     1  Gene3
4     2  Gene3
5     2  Gene4
7     4 Gene10
8     4 Gene11

or a dplyr option

> df %>%
+   group_by(Group) %>%
+   filter(n_distinct(Gene) > 1) %>%
+   ungroup()
# A tibble: 7 x 2
  Group Gene  
  <int> <chr>
1     1 Gene1
2     1 Gene2
3     1 Gene3
4     2 Gene3
5     2 Gene4
6     4 Gene10
7     4 Gene11
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81