-1

I wish to use python to read in a fasta sequence file and convert it into a panda dataframe. I use the following scripts:

from Bio import SeqIO
import pandas as pd

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        # print(desp)
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
        seq_df = pd.DataFrame(seqList)
        print(seq_df.shape)
        seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
    return seq_df


if __name__ == "__main__":
    path = 'path/to/the/fasta/file'
    input = path + 'GISAIDspikeprot0119.selection.fasta'
    df = fasta2df(input)

The 'GISAIDspikeprot0119.selection.fasta' file can be found at https://drive.google.com/file/d/1F5Ir5S6h9rFsVUQkDdZpomiWo9_bXtaW/view?usp=sharing

The script can be run at my linux workstation only with one cpu core, but is it possible to run it with more cores (multiple processes) so that it can be run much faster? What would be the codes for that?

with many thanks!

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
Yeping Sun
  • 405
  • 1
  • 6
  • 18
  • Take a look at ```multiprocessing``` module. Since your code is wrapped in a function you can try to experiment with multiprocessing map. Also if you have to apply this code on multiple files you can start many scripts at the same time (use & in Linux shell to start a process in background) – Edo98 Jan 25 '21 at 01:20
  • Yes, I can try it, but I am not familiar with it. Could you also give you code here for reference? And could multiprocessing specify the number of core called or just call all available cores in the computer? – Yeping Sun Jan 25 '21 at 01:24

1 Answers1

2

Before throwing more CPUs at your problem, you should invest some time in inspecting which parts of your code are slow.

In your case, you are executing the expensive conversion seq_df = pd.DataFrame(seqList) in every loop iteration. This is just wasting CPU time as the result seq_df is overwritten in the next iteration.

Your code took over 15 minutes on my machine. After moving pd.DataFrame(seqList) and the print statement out of the loop it is down to ~15 seconds.

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList)
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

In fact, almost all time is spend in the line seq_df = pd.DataFrame(seqList) - about 13 seconds for me. By setting the dtype explicitly to string, we can bring it down to ~7 seconds:

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList, dtype="string")
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

With this new performance, I highly doubt that you can improve the speed any further by parallel processing.

Lydia van Dyke
  • 2,466
  • 3
  • 13
  • 25