Repeatedly Accessing LARGE fasta files. Most mem efficifent method?

Question

I'm using Biopython to open a large single entry fasta file (514 mega bases) so I can pull out the DNA sequence from specific coordinates. It's reasonably slow to return the sequence and I'm just wondering if there's a faster way to perform this task that I haven't figured out. Speed wouldn't be an issue for just one or two hits, but I'm iterating over a list of 145,000 coordinates and it's gonna take days :/

import sys
from Bio import SeqIO
from Bio.Seq import Seq

def get_seq(fasta, cont_start, cont_end, strand):
  f = fasta
  start_pos = cont_start
  end_pos = cont_end
  for seq_record in SeqIO.parse(f, "fasta"):
   if strand == '-' :
    return seq_record.seq[int(start_pos):int(end_pos)].reverse_complement()
   elif strand == '+':
    return seq_record.seq[int(start_pos):int(end_pos)]
   else :
    print ' Invalid syntax! 
    sys.exit(1)

Do you mean to yield a sequence at those coordinates for _all_ records in the FASTA file? If so, you probably want to use `yield`- this will just return the part of the sequence for the first `seq_record`. — David Cain, Jun 10 '13 at 05:01
@JoachimIsaksson: no, that's correct syntax. The second parameter is a [string identifying the input format](http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html#parse) (Also, `f=fasta` here, so that'd just result in the same value being passed twice). — David Cain, Jun 10 '13 at 15:37
@davidcain Ah, oops, that's what reading code before coffee in the morning works out like apparently :) — Joachim Isaksson, Jun 10 '13 at 16:31

David Cain · Answer 1 · 2013-06-10T04:57:34.633

Your function is parsing the entire file each time you want to find a single strand. There's no need to do that- the "better way" would be to parse all the sequences once, and store them into memory to be accessed later. The easiest way to do this is to just convert the generator returned by SeqIO.parse to a list or similar data structure.

Alternatively, you could store the parsed SeqRecord objects in a database: ZODB, shelve, or simple use of pickle could achieve this.

However, from the looks of your function, you always just return results from the first SeqRecord found in the file (the first iteration through SeqIO.parse(f, "fasta") will either return or call sys.exit(1)). Do you mean to yield instead (I'm assuming you do?).

Here's how I would handle it:

# Parse once, store in a list (alternatively, place DB "load" command here
all_seq_records = list(SeqIO.parse(f, "fasta"))

def get_seq(cont_start, cont_end, strand):
    assert strand in ["+", "-"], "Invalid strand parameter '%s'" % strand
    for seq_record in all_seq_records:
        segment = seq_record.seq[int(start_pos):int(end_pos)]
        yield segment if strand == "+" else segment.reverse_complement()

Thanks for your comments and suggested code David! I am only parsing a single fasta file, and I'm pulling the sequence specified by coordinates from parsing a gff annotation file. So I'm calling get_seq to grab the sequence for start and end pos. I will play around with your suggestion. There's definately some new stuff in there for me to think about :) — user1995839, Jun 10 '13 at 06:02

Repeatedly Accessing LARGE fasta files. Most mem efficifent method?

1 Answers1