Python 3.x - How to efficiently split an array of objects into smaller batch files?

Question

I'm fairly new to Python and I'm attempting to split a textfile where entries consists of two lines into batches of max. 400 objects.

The data I'm working with are thousands of sequences in FASTA format (plain text with a header, used in bioinformatics) where entries look like this:

>HORVU6Hr1G000325.5

PIPPPASHFHPHHQNPSAATQPLCAAMAPAAKKPPLKSSSSHNSAAGDAA

>HORVU6Hr1G000326.1

MVKFTAEELRGIMDKKNNIRNMSVIAHVD

...

In Biopython, there is a parser SeqIO.parse which allows to access these as an array of objects consisting of IDs and strings, which I need to use in later parts of my code, and since I need to be memory efficient, I'd like to avoid reading/parsing the source file more times than necessary.

In Biopython manual, there's a recommended way to do this via a generator, which I'm using: https://biopython.org/wiki/Split_large_file

However, I'm using Python 3.7 whilst the code there is in Python 2.x, so there are definitely some changes necessary. I've changed the

entry = iterator.next()

into

entry = next(iterator)

but I'm not sure if that's all I need to change.

Here's the code:

def batch_iterator(iterator, batch_size=400):
    """Returns lists of length batch_size."""
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = next(iterator)
            except StopIteration:
                entry = None

            if entry is None:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch

while True:
    bsequence = input("Please enter the full path to your FASTA file(e.g. c:\\folder1\\folder2\\protein.fasta):\n")
    try:
        fastafile = open(bsequence)
        break
    except:
        print("File not found!\n")            


record_iter = SeqIO.parse(fastafile,"fasta")
num = 0
for line in fastafile:
    if line.startswith(">"):
        num += 1

print("num=%i" % (num,))
if num > 400:
    print("The specified file contains %i sequences. It's recommended to split the FASTA file into batches of max. 400 sequences.\n" % (num,))
    while True:
        decision = input("Do you wish to create batch files? (Original file will not be overwritten)\n(Y/N):")
        if (decision == 'Y' or 'y'):
            for i, batch in enumerate(batch_iterator(record_iter, 400), 1):
                filename = "group_%i.fasta" % (i + 1)
                with open(filename, "w") as handle:
                    count = SeqIO.write(batch, handle, "fasta")
                print("Wrote %i records to %s" % (count, filename))
            break
        elif (decision == 'N' or 'n'):
            break
        else:
            print('Invalid input\n')

...next part of the code

When I run this, after the Y/N prompt, even if I type Y, the program just skips over to the next part of my code without creating any new file. Debugger shows the following:

Do you wish to create batch files? (Original file will not be overwritten)
(Y/N):Y
Traceback (most recent call last):
  File "\Biopython\mainscript.py", line 32, in batch_iterator
    entry = next(iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1569, in _trace
    return self._trace_and_catch(frame, event, arg)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1611, in _trace_and_catch
    frame.f_back, event, marker_function_args, node

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1656, in _handle_progress_event
    self._save_current_state(frame, event, args, node)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1738, in _save_current_state
    exception_info = self._export_exception_info()

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1371, in _export_exception_info
    "affected_frame_ids": exc[1]._affected_frame_ids_,

AttributeError: 'StopIteration' object has no attribute '_affected_frame_ids_'

Is there some difference between Python 2.x and 3.x that I'm overlooking? Is the problem somewhere else? Is this approach completely wrong? Thanks in advance!

Thanks for the input, I didn't know it's possible to write it this way. It still doesn't solve the problem, though. — Ludolph314, Apr 24 '19 at 21:55
I don't see anything immediately wrong with the batch portion. Can you confirm `record_iter` does have values? — MyNameIsCaleb, Apr 24 '19 at 22:19
Perhaps it makes sense to open an issue on [biopython's github page](https://github.com/biopython/biopython), at least regarding the out-of-date documentation — Asmus, Apr 24 '19 at 22:24
I can confirm, adding the following code after `record_iter = SeqIO.parse(fastafile,"fasta")` produces a list of IDs that seems to be in order. >for record in record_iter: >> print(record.id) — Ludolph314, Apr 24 '19 at 22:26
@Asmus I didn't have enough time to do that yet, and I suspect this is a problem with either my code or some Py-2.x code that slipped through rather than the Biopython package itself. I will open an issue regarding the outdated manual, hopefully once I can also suggest the proper way it should be written in 3.x. Thanks for your suggestion! — Ludolph314, Apr 24 '19 at 22:34

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

I can't check your whole code since you've ommited part of it, but I can see two wrong things here:

num = 0
for line in fastafile:
    if line.startswith(">"):
        num += 1

These lines are exhausting your file object fastafile. Remove these lines entirely (and remember to fix the indentation below, remove the if num > 400: check, etc).

if (decision == 'Y' or 'y'):

This does not do what you think it does. Change it to if decision in ('Y', 'y'): or if decision.lower() == 'y':. You repeat this pattern below in the line if (decision == 'N' or 'n'):, so change that, too.

Make the changes and try to run the code again.

Explanation

1st issue: in Python, a file object (i.e. what open('filename.txt', 'r') returns) is a generator, which means that it can only be iterated over once. This may seem a bit weird at first, but that's the whole point of using generators. A generator as a file object allows the file to be looped over line by line, without ever having to load the whole file content at once - the generator just keeps track of which line comes next.

The flipside is that they can't go backwards, so when you write your for line in fastafile block, you exhaust the generator. When you later try to call batch_iterator(record_iter, 400), the generator in record_iter is already exhausted, which is why you'll encounter an error later on - the batch_iterator cannot parse the fasta sequences if there's nothing left there to parse.

2nd issue: for conditionals with boolean operators such as if (decision == 'Y' or 'y'):, Python will always evaluate both sides individually. So Python actually sees if (bool(decision == 'Y') or bool('y')):.

Since bool('y') evaluates to True (just like any non-empty string), your expression becomes if (bool(decision == 'Y') or True):, which is obviously always true.

Use one of the methods I suggested in order to compare a variable to more than one value in a conditional.

Thanks! Changing the if statements didn't do anything as far as I could notice, but removing the counter for the number of entries did indeed solve the issue. I'm still struggling with the concept of generators a bit, could you please suggest a way to count the entries in such a manner that I could avoid parsing the file twice (if that's possible at all)? //EDIT: Now I also see why changing the if statements didn't seem to do much - it was because even though it was always True just as you wrote, it still skipped the batch file generation due to the issue no. 1 — Ludolph314, Apr 24 '19 at 22:46
Glad I could help! Unfortunately if you want to count the number of sequences (thats is, `>` occurrences) you will have to go through the file two times - once to count the numbers of `>` and another to parse the actual sequences (although the first loop will be much faster than the second, depending on what you'll do with each sequence). To loop over the file a second time, you'll need to move back to its start, either by explicitly closing and reopening it or by using the [`seek` method](https://www.tutorialspoint.com/python/file_seek.htm) on the file object. — jfaccioni, Apr 24 '19 at 23:18
Thank you for your reply. I was afraid of that, and so I checked Biopython manual entry for SeqIO. Turns out it should allow you to index the entries while parsing like this: `for index, record in enumerate(SeqIO.parse("ls_orchid.fasta", "fasta")): print("index %i, ID = %s, length %i" % (index, record.id, len(record.seq)))` — Ludolph314, Apr 24 '19 at 23:36

Python 3.x - How to efficiently split an array of objects into smaller batch files?

1 Answers1

Explanation