How to create a list holding multiple fasta sequences and ids in Python

Question

I am new to Python. I am trying to take genome fasta file containing 8 chromosome sequences as input, blast it against a query sequence and extract the top 50 hits. Hre's my code:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Blast import NCBIXML
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
import os

db = list(SeqIO.parse('genome.fasta', 'fasta'))
for i in range(len(db)):
    print "Chromosome_"+str(i+1)
    print('Doing the BLAST and retrieving the results...')
    output_handle = open("dbn.fasta", "w")
    nseq=db[i]
    SeqIO.write(nseq, output_handle, "fasta")
    output_handle.close()
    os.system("makeblastdb -in dbn.fasta -dbtype nucl -out dbn")
    result_handle= NcbiblastnCommandline(query="sequence.fasta", db = "dbn", outfmt= 5, out="my_blast_"+str(i)+".xml", evalue = 0.00001, task = "megablast")
    stderr, stdout = result_handle()
    E_VALUE_THRESH = 1e-100
    c=0
    print "Extracting hits for Chromosome"+str(i+1)+"in another file"
    for record in NCBIXML.parse(open("my_blast_"+str(i)+".xml")):
        if record.alignments:
            for align in record.alignments:
                for hsp in align.hsps:
                    if hsp.expect < E_VALUE_THRESH:
                        if c>50: break
                        start=hsp.sbjct_start
                        end=hsp.sbjct_end
                        newSeq = db[i].seq[:end]
                        newSeq = newSeq[start:]
                        ids=db[i].id+str(c)
                        myseq[c]=SeqRecord(Seq(newSeq,IUPAC.DNA))
                        myseq[c].id=str(ids)                        
                        c=c+1

    output_handler = open("example_"+str(i)+".fasta", "w")
    SeqIO.write(myseq, output_handler, "fasta")
    output_handler.close()

It gives me an error on line 33:

AttributeError: 'module' object has no attribute 'DNA'

I tried to remove IUPAC.DNA, making the line:

myseq[c]=SeqRecord(Seq(newSeq)

It gives me a type error:

TypeError: The sequence data given to a Seq object should be a string (not another Seq object etc)

Is there a way to solve this, and is there any other way to create a multi-fasta list variable?

newSeq is a Seq object already. You cannot convert a Seq object into a Seq Object. turn the line into SeqRecord(newSeq) — Stylize, Apr 14 '15 at 20:16
Thanks, I made the changes. However, the code is giving me another error on line 34: `NameError: name 'myseq' is not defined` — RRN, Apr 15 '15 at 14:02
I tried to define it as: `myseq = []` , but it doesnt work. I don't know exactly how this list is defined in Python. — RRN, Apr 15 '15 at 14:02
You try to access `myseq[c]` before actually initializing it anywhere. You need to initialize it before entering the loop. Furthermore, if you intialize it to an empty list, you will need to append elements instead of indexing them with the `c` variable, otherwise you will get an index out of range error. — cnluzon, Apr 15 '15 at 15:34
Do you have to parse the input files within python? In other words, can you just leave the fasta and db file as is, and invoke blast from command line. Then you can read the results file into python and parse the results. — Malonge, Apr 16 '15 at 23:04

score 0 · Answer 1 · answered Apr 16 '15 at 23:20

0

Maybe you have to parse the fasta within python for some reason, but if not, why not just leave the file as is and invoke blast from the command line? Then parse the results from within python.

import subprocess

subprocess.call('makeblastdb -dbtype nucl -in dbn.fasta -out dbn.DATABASE')
subprocess.call('blastn -db dbn.DATABASE -query genome.fasta -outfmt 7 -out blast_results')

for line in open('blast_results') # outfmt 7 is tabular output
    if not line.startswith('#'):
        # Get top 50 hits (per chromosome I assume)

answered Apr 16 '15 at 23:20

Malonge

1,980
5
23
33

Thanks for your reply. I have to parse the fasta, for instance, to get hits in another file, and also to get the flanking regions in another file. I was able to extract the sequences from the chromosomes, and wrote them directly to another file starting with with '>'. – RRN Apr 20 '15 at 10:14

score 0 · Answer 2 · answered Apr 20 '15 at 10:19

0

Thanks all for your suggestions. I was able to extract the sequences from the chromosomes, and wrote them directly to another file starting with with '>'

ids=db[i]+"_"+str(hsp.sbjct_start)+"_"+str(hsp.sbjct_end)
f.write(">"+str(ids) + "\n")
f.write(str(newSeq) + "\n")

answered Apr 20 '15 at 10:19

RRN

3
3

Why didn't you use `SeqIO` to write the `Seq` objects to a file? – cnluzon Apr 20 '15 at 15:21

How to create a list holding multiple fasta sequences and ids in Python

2 Answers2