Keep common rows among groups based on a column in dplyr

Question

My data frame looks like this

df <- data.frame(gene=c("A","B","C","A","B","D"), 
                 origin=rep(c("old","new"),each=3),
                 value=sample(rnorm(10,2),6))

  gene origin     value
1    A    old 1.5566908
2    B    old 1.3000358
3    C    old 0.7668213
4    A    new 2.5274712
5    B    new 2.2434525
6    D    new 2.0758326

I want to find the common genes between the two different groups of origin (old and new)

I want my data to look like this

  gene origin     value
1    A    old 1.5566908
2    B    old 1.3000358
4    A    new 2.5274712
5    B    new 2.2434525

Any help is appreciated. Ideally I would like to find common rows among groups using multiple columns

score 4 · Accepted Answer · answered Aug 02 '21 at 13:56

You can use split and reduce to get the common genes and use it in filter.

library(dplyr)
library(purrr)

df %>% filter(gene %in% (split(df$gene, df$origin) %>% reduce(intersect)))

#  gene origin value
#1    A    old 1.271
#2    B    old 2.838
#3    A    new 0.974
#4    B    new 1.375

Or keeping in base R -

subset(df, gene %in% Reduce(intersect, split(df$gene, df$origin)))

score 4 · Answer 2 · answered Aug 02 '21 at 14:00

4

A base R option using ave + subset

subset(
  df,
  as.logical(ave(origin,gene,FUN = function(x) all(c("old","new")%in% x)))
)

gives

  gene origin     value
1    A    old 0.5994593
2    B    old 4.0449345
4    A    new 3.2478612
5    B    new 0.2673525

answered Aug 02 '21 at 14:00

ThomasIsCoding

96,636
9
24
81

score 3 · Answer 3 · answered Aug 02 '21 at 13:54

3

One possibility could be:

df %>%
    group_by(gene) %>%
    filter(all(c("old", "new") %in% origin))

  gene  origin value
  <chr> <chr>  <dbl>
1 A     old    1.63 
2 B     old    0.904
3 A     new    2.18 
4 B     new    1.24

answered Aug 02 '21 at 13:54

tmfmnk

38,881
4
47
67

Dear tmfmmk this is a super cool and neat solution. Can you explain how it works, please? It looks amazing – LDT Aug 03 '21 at 08:29

Serkan · Answer 4 · 2021-08-02T14:02:13.047

3

I would filter according to duplicates, and scan it from last and first.

library(tidyverse)

df %>% filter(
        duplicated(gene, fromLast = TRUE) | duplicated(gene, fromLast = FALSE)
)

  gene origin    value
1    A    old 2.665606
2    B    old 1.565466
3    A    new 4.025450
4    B    new 2.647110

Note: I cant replicate your data as you didnt provide a seed!

edited Aug 02 '21 at 14:02

answered Aug 02 '21 at 13:55

Serkan

1,855
6
20

score 3 · Answer 5 · answered Aug 02 '21 at 17:07

3

Using subset with table in base R

subset(df, gene %in% names(which(rowSums(table(gene, origin) > 0) == 2)))
  gene origin     value
1    A    old 3.0536642
2    B    old 2.0796124
4    A    new 0.1621484
5    B    new 2.3587338

answered Aug 02 '21 at 17:07

akrun

874,273
37
540
662

Keep common rows among groups based on a column in dplyr

5 Answers5