5

I am a newly self taught (minus 1 class on the very basics) programmer working for a bio lab. I have a script that goes though RNAseq data from two different cell types and runs a ttest if in another dataset. It worked for this application but the code feels very brutish and I know I will be writing similar scripts a lot.

How can I better write the following code to make it more efficient?

Goal of program:

  1. compare a list of genes to a rnaseq library of two cell types, if the library contains that gene, run a ttest of cell type 1 vs cell type 2
  2. output results.

:

import pandas as pd
from scipy.stats import ttest_ind
rnatest = {'Gene symbol':["GeneA","GeneB"],"rnaseq1A":[1,1.5],"rnaseq1B":[1.3,1.2],"rnaseq2A":[2.3,2.7],"rnaseq2B":[2,2.6]} 
df = pd.DataFrame(rnatest)
GOIlist = ["GeneA","GeneB"]
GOI = []
mu = [] 
pval = []
for index, row in df.iterrows():
  if row['Gene symbol'] in GOIlist:
    t, p = ttest_ind([row["rnaseq1A"],row["rnaseq1B"]],[row["rnaseq2A"],row["rnaseq2B"]])
    GOI.append(row['Gene symbol'])
    mu.append(t)
    pval.append(p)
df2 = {'Gene symbol':GOI,"tVAL":mu, "pVAL":pval}
df2 = pd.DataFrame(df2)
print(df2)  
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • Do you decide on the RNA library format or is that fixed (and in the format you gave)? – Mark Snyder Jan 22 '20 at 21:12
  • Please be aware that statistical treatment of RNA-seq data may have specificities. Suggested keywords to look up: differential expression analysis, correction for multiple testing – bli Jan 25 '20 at 15:54
  • Maybe a question for codereview.stackexchange.com? – Jir Feb 12 '20 at 07:59

1 Answers1

5

The advantage of using pandas is that you can do columnwise operations. These are generally more efficient then iterating over the DataFrame with a for loop.

I slightly modified your df to show you the effect of filtering out the rows that we need.

>>> import pandas as pd
>>> from scipy.stats import ttest_ind
>>> GOIlist = ["GeneA","GeneB"]
>>> rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"],"rnaseq1A":[1,1.5,2],"rnaseq1B":[1.3,1.2,1.1],"rnaseq2A":[2.3,2.7,3.1],"rnaseq2B":[2,2.6,3.2]} 
>>> df = pd.DataFrame(rnatest)
>>> print(df)

    Gene symbol     rnaseq1A    rnaseq1B    rnaseq2A    rnaseq2B
0   GeneA           1.0         1.3         2.3         2.0
1   GeneB           1.5         1.2         2.7         2.6
2   GeneC           2.0         1.1         3.1         3.2

Now how I would rewrite your code:

  1. Use set_index to make the Gene symbol row an index, this speeds up the lookup time (especially if you have large DataFrames)
  2. Use loc to filter out the rows that have a Gene symbol that is in GOIlist
  3. Create two new columns pVal and tVal to which you assign the output of ttest_ind. Note that we don't have to iterate over the rows anymore.
  4. Optionally, drop the rnaseq* columns if you don't want to see them in your output.

In code:

>>> df3 = df.set_index(['Gene symbol'])
>>> df3 = df3.loc[GOIlist]
>>> df3['tVal'], df3['pVal'] = ttest_ind([df3["rnaseq1A"], df3["rnaseq1B"]], [df3["rnaseq2A"], df3["rnaseq2B"]])
>>> df3 = df3.drop(['rnaseq1A', 'rnaseq1B', 'rnaseq2A', 'rnaseq2B'], axis=1)
>>> print(df3)
            tVal        pVal
Gene symbol         
GeneA       -4.714045   0.042174
GeneB       -8.221922   0.014473

So, how much more efficient is this code now?

If I artificially increase the size of our DataFrame 10.000 times (so in total 30.000 rows instead of 3)

n = 10_000
rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"]*n, "rnaseq1A":[1,1.5,2]*n, "rnaseq1B":[1.3,1.2,1.1]*n, "rnaseq2A":[2.3,2.7,3.1]*n, "rnaseq2B":[2,2.6,3.2]*n} 
df = pd.DataFrame(rnatest)

then I can use timeit to measure the execution time of the code. For your original approach I get the result:

13.7 s ± 555 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while my approach finishes in

45.2 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

so that is a more than 300 times speedup!

BioGeek
  • 21,897
  • 23
  • 83
  • 145