Filter lines of a table that match with your set values

Question

I have converted the elements of a column in a set

set_genes = set(df['genes'].unique()]

And I also have a table (a tsv file) whose one column has values that match my set. I want to extract from this table the lines where values match.

Example

print(set_genes)
{'IDA'}

print(file)

1    1      10  IDA     ID1
1    10     20  IDA     ID2
1    20     30  IDA     ID3
2    1      10  IDB     ID1
2    20     20  IDB     ID2
2    30     30  IDB     ID3

Results

1    1      10  IDA     ID1
1    10     20  IDA     ID2
1    20     30  IDA     ID3

Abhyuday Vaish · Accepted Answer · 2022-04-27T11:38:44.210

If your TSV file is a dataframe called df then use this. Here column_name is the name of the column which contains set_genes:

df.loc[df['column_name'].isin(set_genes)]

Sample example:

import pandas as pd

df = pd.DataFrame({'C1': [1,1,1,2,2,2], 'C2': [1, 10, 20, 1, 10 ,30], 'C3': [10,20,30,10,20,30], 'C4': ['IDA', 'IDA', 'IDA', 'IDB', 'IDB', 'IDB'], 'C5':['ID1', 'ID2', 'ID3','ID1', 'ID2', 'ID3']})
df
   C1  C2  C3   C4   C5
0   1   1  10  IDA  ID1
1   1  10  20  IDA  ID2
2   1  20  30  IDA  ID3
3   2   1  10  IDB  ID1
4   2  10  20  IDB  ID2
5   2  30  30  IDB  ID3
set_genes = {'IDA'}
df2 = df.loc[df['C4'].isin(set_genes)]
df2
   C1  C2  C3   C4   C5
0   1   1  10  IDA  ID1
1   1  10  20  IDA  ID2
2   1  20  30  IDA  ID3

score 0 · Answer 2 · answered Apr 27 '22 at 11:23

0

you can try something like this:

import pandas as pd
  
data = {
    'A':['d', 'q', 's', 'a', 'a'], 
    'genes':['ID1', 'ID2', 'ID3', 'ID4', 'ID4'],  }
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
  
# print(df)
# Get the unique values of 'B' column
df.genes.unique()

the out put is :

array(['ID1', 'ID2', 'ID3', 'ID4'], dtype=object)

answered Apr 27 '22 at 11:23

bara-elba

146
6

Thanks for your answers Abhyuday Vaish and bara-elba but I think that importing the file into padas is not a good idea as the file is massive and the application will be running this many times. I was thinking something like reading lines by lines. If this approach correct? – Manolo Dominguez Becerra Apr 27 '22 at 11:26
@ManuelDominguezBecerra Line by line would be slow as well. How big is your file? – Abhyuday Vaish Apr 27 '22 at 11:28
600K lines and 49.2 MB – Manolo Dominguez Becerra Apr 27 '22 at 11:29
@ManuelDminguezBecerra No problem it would work. – Abhyuday Vaish Apr 27 '22 at 11:30
Great! I will try after my lunch and I will come back to you soon. Thanks!! – Manolo Dominguez Becerra Apr 27 '22 at 11:31

Filter lines of a table that match with your set values

2 Answers2