3

In python this code, where I directly call the function SeqIO.parse() , runs fine:

from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)

for asq in SeqIO.parse("a.fasta", "fasta"):
    print("Q")

But this, where I first store the output of SeqIO.parse() in a variable(?) called a, and then try to use it in my loop, it doesn't run:

from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)

for asq in a:
    print("Q")

Is this because a the output from the function || SeqIO.parse("a.fasta", "fasta") || is being stored in 'a' differently from when I directly call it? What exactly is the identity of 'a' here. Is it a variable? Is it an object? What does the function actually return?

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
  • 2
    I don't know about biopython or python in general, but I know a thing or two about a thing or two and I have a suspicion. What happens if you don't run `list(a)` before `for asq in a`? -- I'm guessing that the function `SeqIO.parse` returns some sort of iterator that has its own state. I assume that calling `list(...)` iterates it as far as possible, and it is never reset in any manner, so further attempts to use it just try to continue iterating from that finished state and so don't receive any more results. This is just speculation on my part though. – moreON Feb 21 '19 at 02:36
  • Holy jesus, It works. But I have no idea why. How do you reset the state of an iterator? Why does such an interaction even exist? Is this kind of behavior common in other high level languages? – Abraham Ahmad Feb 21 '19 at 02:42
  • It's entirely plausible that you can't. It depends entirely on the implementation of the iterator - this is where my no-specific-knowledge wall is apparent. On that note, I expect that you expect that `SeqIO.parse()` will return the same result for each of those identical calls, so why not just iterate over the list that you have already created the second time? I assume that the point of saving into the list is to avoid repeated calls to `SeqIO.parse` because it's expensive for some reason (i.e. complex calculations, disk/network reads, etc) – moreON Feb 21 '19 at 02:47
  • 1
    For the other part of that follow-up. Yeah, something that reads from (file/network/database/etc) and is only able to go forwards is pretty common and leads to things like this. If you can guarantee only to need to read something forward, once (and you often can) then it can be significantly cheaper to create something that is only capable of providing that capability without being able to go backwards or jump to arbitrary points. – moreON Feb 21 '19 at 02:50
  • @moreON had a correct suspicion, as explained in the answer by Chris_Rands. When you do `a = SeqIO.parse("a.fasta", "fasta")`, `a` is just ready to iterate over the fasta file and produce records, but has not done it yet. When you do `records = list(a)`, then the iteration happens, in order to fill the list. And when you try to loop over `a` again, it has already reached the end of the fasta file, and produces no more records. – bli Feb 27 '19 at 16:42

2 Answers2

4

SeqIO.parse() returns a normal python generator. This part of the Biopython module is written in pure python:

>>> from Bio import SeqIO
>>> a = SeqIO.parse("a.fasta", "fasta")
>>> type(a)
<class 'generator'>

Once a generator is iterated over it is exhausted as you discovered. You can't rewind a generator but you can store the contents in a list or dict if you don't mind putting it all in memory (useful if you need random access). You can use SeqIO.to_dict(a) to store in a dictionary with the record ids as the keys and sequences as the values. Simply re-building the generator calling SeqIO.parse() again will avoid dumping the file contents into memory of course.

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
0

I have a similar issue that the parsed sequence file doesn't work inside a for-loop. Code below:

genomes_l = pd.read_csv('test_data.tsv', sep='\t', header=None, names=['anonymous_gsa_id', 'genome_id'])
# sample_f = SeqIO.parse('SAMPLE.fasta', 'fasta')

for i, r in genomes_l.iterrows():
    genome_name = r['anonymous_gsa_id']
    genome_ids = r['genome_id'].split(',')
    genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]
    with open(f'out_dir/{genome_name}_contigs.fasta', 'w') as handle:
        SeqIO.write(genome_contigs, handle, 'fasta')

Originally, I read the file in as sample_f, however inside the loop it wouldn't work. Would appreciate any help to avoid having to read the file over and over again. Specifically the below line:

genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]

Thank you!

Susheel Busi
  • 163
  • 8