7

My data frame looks like this

df <- data.frame(gene=c("A","B","C","A","B","D"), 
                 origin=rep(c("old","new"),each=3),
                 value=sample(rnorm(10,2),6))

  gene origin     value
1    A    old 1.5566908
2    B    old 1.3000358
3    C    old 0.7668213
4    A    new 2.5274712
5    B    new 2.2434525
6    D    new 2.0758326

I want to find the common genes between the two different groups of origin (old and new)

I want my data to look like this

  gene origin     value
1    A    old 1.5566908
2    B    old 1.3000358
4    A    new 2.5274712
5    B    new 2.2434525

Any help is appreciated. Ideally I would like to find common rows among groups using multiple columns

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
LDT
  • 2,856
  • 2
  • 15
  • 32

5 Answers5

4

You can use split and reduce to get the common genes and use it in filter.

library(dplyr)
library(purrr)

df %>% filter(gene %in% (split(df$gene, df$origin) %>% reduce(intersect)))

#  gene origin value
#1    A    old 1.271
#2    B    old 2.838
#3    A    new 0.974
#4    B    new 1.375

Or keeping in base R -

subset(df, gene %in% Reduce(intersect, split(df$gene, df$origin)))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
4

A base R option using ave + subset

subset(
  df,
  as.logical(ave(origin,gene,FUN = function(x) all(c("old","new")%in% x)))
)

gives

  gene origin     value
1    A    old 0.5994593
2    B    old 4.0449345
4    A    new 3.2478612
5    B    new 0.2673525
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
3

One possibility could be:

df %>%
    group_by(gene) %>%
    filter(all(c("old", "new") %in% origin))

  gene  origin value
  <chr> <chr>  <dbl>
1 A     old    1.63 
2 B     old    0.904
3 A     new    2.18 
4 B     new    1.24 
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • Dear tmfmmk this is a super cool and neat solution. Can you explain how it works, please? It looks amazing – LDT Aug 03 '21 at 08:29
3

I would filter according to duplicates, and scan it from last and first.

library(tidyverse)

df %>% filter(
        duplicated(gene, fromLast = TRUE) | duplicated(gene, fromLast = FALSE)
)
  gene origin    value
1    A    old 2.665606
2    B    old 1.565466
3    A    new 4.025450
4    B    new 2.647110

Note: I cant replicate your data as you didnt provide a seed!

Serkan
  • 1,855
  • 6
  • 20
3

Using subset with table in base R

subset(df, gene %in% names(which(rowSums(table(gene, origin) > 0) == 2)))
  gene origin     value
1    A    old 3.0536642
2    B    old 2.0796124
4    A    new 0.1621484
5    B    new 2.3587338
akrun
  • 874,273
  • 37
  • 540
  • 662