Filtering a Dataset by another Dataset in R

Question

The task I am trying to accomplish is essentially filtering one dataset by the entries in another dataset by entries in an "id" column. The data sets I am working with are quite large having 10 of thousands of entries and 30 or so variables. I have made toy datasets to help explain what I want to do.

The first dataset contains a list of entries and each entry has their own unique accession number(this is the id).

Data1 = data.frame(accession_number = c('a','b','c','d','e','f'), values =c('1','3','4','2','3','12'))
>Data1
      accession_number values
1                a      1
2                b      3
3                c      4
4                d      2
5                e      3
6                f     12

I am only interested in the entries that have the accession number 'c', 'd', and 'e'. (In reality though my list is around 100 unique accession numbers). Next, I created a dataframe with the only the unique accession numbers and no other values.

>SubsetData1
  accession_number
1                c
2                d
3                e

The second data set, which i am looking to filter, contains multiple entries some which have the same accession number.

>Data2
   accession_number values  Intensity col4       col6
1                 a      1 -0.0251304    a -0.4816370
2                 a      2 -0.4308735    b -1.0335971
3                 c      3 -1.9001321    c  0.6416735
4                 c      4  0.1163934    d -0.4489048
5                 c      5  0.7586820    e  0.5408650
6                 b      6  0.4294415    f  0.6828412
7                 b      7 -0.8045201    g  0.6677730
8                 b      8 -0.9898947    h  0.3948412
9                 c      9 -0.6004642    i -0.3323932
10                c     10  1.1367578    j  0.9151915
11                c     11  0.7084980    k -0.3424039
12                c     12 -0.9618102    l  0.2386307
13                c     13  0.2693441    m -1.3861064
14                d     14  1.6059971    n  1.3801924
15                e     15  2.4166472    o -1.1806929
16                e     16 -0.7834619    p  0.1880451
17                e     17  1.3856535    q -0.7826357
18                f     18 -0.6660976    r  0.6159731
19                f     19  0.2089186    s -0.8222399
20                f     20 -1.5809582    t  1.5567113
21                f     21  0.3610700    u  0.3264431
22                f     22  1.2923324    v  0.9636267

What im looking to do is compare the subsetted list of the first data set(SubsetData1), with the second dataset (Data2) to create a filtered dataset that only contains the entries that have the same accession numbers defined in the subsetted list. The filtered dataset should look something like this.

   accession_number values  Intensity col4       col6
9                 c      9 -0.6004642    i -0.3323932
10                c     10  1.1367578    j  0.9151915
11                c     11  0.7084980    k -0.3424039
12                c     12 -0.9618102    l  0.2386307
13                c     13  0.2693441    m -1.3861064
14                d     14  1.6059971    n  1.3801924
15                e     15  2.4166472    o -1.1806929
16                e     16 -0.7834619    p  0.1880451
17                e     17  1.3856535    q -0.7826357

I don't know if I need to start making loops in order to tackle this problem, or if there is a simple R command that would help me accomplish this task. Any help is much appreciated.

Thank You

So `Data1` doesn't matter at all here? I don't understand why row 3 of `Data2` isn't in the desired output since it has an accession_number of "c". — MrFlick, May 06 '15 at 04:19
Oh wow. I didn't realize that, and that was totally a typo on my part. Sorry about that Mr.Flick. — SnakeSnake, May 06 '15 at 04:41

score 4 · Accepted Answer · answered May 06 '15 at 04:19

4

Try this

WantedData=Data2[Data2$ccession_number %in% SubsetData1$accession_number, ]

answered May 06 '15 at 04:19

BICube

4,451
1
23
44

score 2 · Answer 2 · answered Jun 09 '21 at 17:42

2

You can also use inner_join of dplyr package.

dat = inter_join(Data2, SubsetData1)

answered Jun 09 '21 at 17:42

Nad Pat

3,129
3
10
20

score 1 · Answer 3 · answered May 06 '15 at 04:19

1

The subset function is designed for basic subsetting:

subset(Data2,accession_number %in% SubsetData1$accession_number)

Alternately, here you could merge:

merge(Data2,SubsetData1)

answered May 06 '15 at 04:19

Frank

66,179
8
96
180

score 0 · Answer 4 · answered May 06 '15 at 05:08

0

The other solutions seem fine, but I like the readability of dplyr, so here's a dplyr solution.

library(dplyr)
new_dataset <- Data2 %>%
    filter(accession_number %in% SubsetData1$accession_number)

answered May 06 '15 at 05:08

goodtimeslim

880
7
13

5

If you are going to use dplyr then `semi_join()` might be a better choice than `filter()` – MrFlick May 06 '15 at 05:13

Filtering a Dataset by another Dataset in R

4 Answers4

Linked