1

I'm trying to extract a set of genes (row names) from my large data set based on another data matrix that contains a list of my genes of interest. I've read about that I should use the filter and %in% command, but am unsure as to how to write it properly.

example: my large database:

Gene        Week1         Week 2.        Week 3
A.           20.           14.            5
B.           5.            10.            15
C.           2.            4.             6
D.           20.           18.            19

my small data base:

Gene
A
C
D

And I want my result to be:

Gene        Week1         Week 2.        Week 3
A.           20.           14.            5
C.           2.            4.             6
D.           20.           18.            19

Could anybody please help out? I'd really appreciate it and my apologies for the rather simple question :)

dario
  • 6,415
  • 2
  • 12
  • 26
  • Did you take a look at this topic? https://stackoverflow.com/questions/18933187/how-to-select-some-rows-with-specific-rownames-from-a-dataframe – Lieke Jun 04 '20 at 14:21

1 Answers1

2

Using logical row indexes:

large_database[large_database$Gene %in% unique(small_data_base$Gene), ]

Explanation:

large_database$Gene %in% unique(small_data_base$Gene)

Checks for each entry (i.e. row) in large_database$Gene if it appears in unique(small_database$Gene) i.e. the list of unique values in the column Gene of small_data_base and returns a boolean vector (a vector of TRUE and FALSE).

We then can use this vector as a row 'index' to selecet only rows where the vector is TRUE (i.e. the value of large_database$Gene was in unique(small_database$Gene)

dario
  • 6,415
  • 2
  • 12
  • 26
  • In case `Gene` is in both unique, `match` could be used: `large_database[match(small_data_base$Gene, large_database$Gene),]` – GKi Jun 04 '20 at 14:24
  • Actually for `match` or `tidyr::join` etc. to work it would be enough if `small_data_base$Gene` has only unique values (not both). But since the question asked explicitly for a solution using `%` suggested the solution above ;) – dario Jun 04 '20 at 14:28