I have a dataset of genes which are in numbered groups. Sometimes one gene can appear in multiple groups, and I want to remove a duplicate of a gene if it is appearing in another group and it is the only gene in that group.
For example my data looks like:
Group Gene
1 Gene1
1 Gene2
1 Gene3
2 Gene3
2 Gene4
3 Gene1
4 Gene10
4 Gene11
Gene1 appears in groups 1 and 3, but in group 3 it is it the only gene in that group so I want to remove that row. How can I code for this condition to then remove duplicates by that filter?
I've seen some similar questions and I've tried repurposing code but I haven't got very far. At the moment I am trying:
library(dplyr)
df %>%
group_by(Gene) %>%
mutate(duplicates = filter(n() > 1))
mutate(to_remove = duplicates == TRUE & Group = filter(n() == 1))
But I'm probably not getting the syntax of this right at all, I don't use dplyr a lot.
Example input data:
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), Gene = c("Gene1",
"Gene2", "Gene3", "Gene3", "Gene4", "Gene1", "Gene10", "Gene11"
)), row.names = c(NA, -8L), class = c("data.table", "data.frame"
))