How do I make more efficient code for a search for multiple strings in column in pandas

Question

I am a newly self taught (minus 1 class on the very basics) programmer working for a bio lab. I have a script that goes though RNAseq data from two different cell types and runs a ttest if in another dataset. It worked for this application but the code feels very brutish and I know I will be writing similar scripts a lot.

How can I better write the following code to make it more efficient?

Goal of program:

compare a list of genes to a rnaseq library of two cell types, if the library contains that gene, run a ttest of cell type 1 vs cell type 2
output results.

:

import pandas as pd
from scipy.stats import ttest_ind
rnatest = {'Gene symbol':["GeneA","GeneB"],"rnaseq1A":[1,1.5],"rnaseq1B":[1.3,1.2],"rnaseq2A":[2.3,2.7],"rnaseq2B":[2,2.6]} 
df = pd.DataFrame(rnatest)
GOIlist = ["GeneA","GeneB"]
GOI = []
mu = [] 
pval = []
for index, row in df.iterrows():
  if row['Gene symbol'] in GOIlist:
    t, p = ttest_ind([row["rnaseq1A"],row["rnaseq1B"]],[row["rnaseq2A"],row["rnaseq2B"]])
    GOI.append(row['Gene symbol'])
    mu.append(t)
    pval.append(p)
df2 = {'Gene symbol':GOI,"tVAL":mu, "pVAL":pval}
df2 = pd.DataFrame(df2)
print(df2)

Do you decide on the RNA library format or is that fixed (and in the format you gave)? — Mark Snyder, Jan 22 '20 at 21:12
Please be aware that statistical treatment of RNA-seq data may have specificities. Suggested keywords to look up: differential expression analysis, correction for multiple testing — bli, Jan 25 '20 at 15:54

BioGeek · Answer 1 · 2020-01-27T10:02:29.963

The advantage of using pandas is that you can do columnwise operations. These are generally more efficient then iterating over the DataFrame with a for loop.

I slightly modified your df to show you the effect of filtering out the rows that we need.

>>> import pandas as pd
>>> from scipy.stats import ttest_ind
>>> GOIlist = ["GeneA","GeneB"]
>>> rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"],"rnaseq1A":[1,1.5,2],"rnaseq1B":[1.3,1.2,1.1],"rnaseq2A":[2.3,2.7,3.1],"rnaseq2B":[2,2.6,3.2]} 
>>> df = pd.DataFrame(rnatest)
>>> print(df)

    Gene symbol     rnaseq1A    rnaseq1B    rnaseq2A    rnaseq2B
0   GeneA           1.0         1.3         2.3         2.0
1   GeneB           1.5         1.2         2.7         2.6
2   GeneC           2.0         1.1         3.1         3.2

Now how I would rewrite your code:

Use set_index to make the Gene symbol row an index, this speeds up the lookup time (especially if you have large DataFrames)
Use loc to filter out the rows that have a Gene symbol that is in GOIlist
Create two new columns pVal and tVal to which you assign the output of ttest_ind. Note that we don't have to iterate over the rows anymore.
Optionally, drop the rnaseq* columns if you don't want to see them in your output.

In code:

>>> df3 = df.set_index(['Gene symbol'])
>>> df3 = df3.loc[GOIlist]
>>> df3['tVal'], df3['pVal'] = ttest_ind([df3["rnaseq1A"], df3["rnaseq1B"]], [df3["rnaseq2A"], df3["rnaseq2B"]])
>>> df3 = df3.drop(['rnaseq1A', 'rnaseq1B', 'rnaseq2A', 'rnaseq2B'], axis=1)
>>> print(df3)
            tVal        pVal
Gene symbol         
GeneA       -4.714045   0.042174
GeneB       -8.221922   0.014473

So, how much more efficient is this code now?

If I artificially increase the size of our DataFrame 10.000 times (so in total 30.000 rows instead of 3)

n = 10_000
rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"]*n, "rnaseq1A":[1,1.5,2]*n, "rnaseq1B":[1.3,1.2,1.1]*n, "rnaseq2A":[2.3,2.7,3.1]*n, "rnaseq2B":[2,2.6,3.2]*n} 
df = pd.DataFrame(rnatest)

then I can use timeit to measure the execution time of the code. For your original approach I get the result:

13.7 s ± 555 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while my approach finishes in

45.2 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

so that is a more than 300 times speedup!

How do I make more efficient code for a search for multiple strings in column in pandas

1 Answers1