I'm using Biopython to open a large single entry fasta file (514 mega bases) so I can pull out the DNA sequence from specific coordinates. It's reasonably slow to return the sequence and I'm just wondering if there's a faster way to perform this task that I haven't figured out. Speed wouldn't be an issue for just one or two hits, but I'm iterating over a list of 145,000 coordinates and it's gonna take days :/
import sys
from Bio import SeqIO
from Bio.Seq import Seq
def get_seq(fasta, cont_start, cont_end, strand):
f = fasta
start_pos = cont_start
end_pos = cont_end
for seq_record in SeqIO.parse(f, "fasta"):
if strand == '-' :
return seq_record.seq[int(start_pos):int(end_pos)].reverse_complement()
elif strand == '+':
return seq_record.seq[int(start_pos):int(end_pos)]
else :
print ' Invalid syntax!
sys.exit(1)