I'm fairly new to Python and I'm attempting to split a textfile where entries consists of two lines into batches of max. 400 objects.
The data I'm working with are thousands of sequences in FASTA format (plain text with a header, used in bioinformatics) where entries look like this:
>HORVU6Hr1G000325.5
PIPPPASHFHPHHQNPSAATQPLCAAMAPAAKKPPLKSSSSHNSAAGDAA
>HORVU6Hr1G000326.1
MVKFTAEELRGIMDKKNNIRNMSVIAHVD
...
In Biopython, there is a parser SeqIO.parse which allows to access these as an array of objects consisting of IDs and strings, which I need to use in later parts of my code, and since I need to be memory efficient, I'd like to avoid reading/parsing the source file more times than necessary.
In Biopython manual, there's a recommended way to do this via a generator, which I'm using: https://biopython.org/wiki/Split_large_file
However, I'm using Python 3.7 whilst the code there is in Python 2.x, so there are definitely some changes necessary. I've changed the
entry = iterator.next()
into
entry = next(iterator)
but I'm not sure if that's all I need to change.
Here's the code:
def batch_iterator(iterator, batch_size=400):
"""Returns lists of length batch_size."""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = next(iterator)
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
while True:
bsequence = input("Please enter the full path to your FASTA file(e.g. c:\\folder1\\folder2\\protein.fasta):\n")
try:
fastafile = open(bsequence)
break
except:
print("File not found!\n")
record_iter = SeqIO.parse(fastafile,"fasta")
num = 0
for line in fastafile:
if line.startswith(">"):
num += 1
print("num=%i" % (num,))
if num > 400:
print("The specified file contains %i sequences. It's recommended to split the FASTA file into batches of max. 400 sequences.\n" % (num,))
while True:
decision = input("Do you wish to create batch files? (Original file will not be overwritten)\n(Y/N):")
if (decision == 'Y' or 'y'):
for i, batch in enumerate(batch_iterator(record_iter, 400), 1):
filename = "group_%i.fasta" % (i + 1)
with open(filename, "w") as handle:
count = SeqIO.write(batch, handle, "fasta")
print("Wrote %i records to %s" % (count, filename))
break
elif (decision == 'N' or 'n'):
break
else:
print('Invalid input\n')
...next part of the code
When I run this, after the Y/N prompt, even if I type Y, the program just skips over to the next part of my code without creating any new file. Debugger shows the following:
Do you wish to create batch files? (Original file will not be overwritten)
(Y/N):Y
Traceback (most recent call last):
File "\Biopython\mainscript.py", line 32, in batch_iterator
entry = next(iterator)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1569, in _trace
return self._trace_and_catch(frame, event, arg)
File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1611, in _trace_and_catch
frame.f_back, event, marker_function_args, node
File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1656, in _handle_progress_event
self._save_current_state(frame, event, args, node)
File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1738, in _save_current_state
exception_info = self._export_exception_info()
File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1371, in _export_exception_info
"affected_frame_ids": exc[1]._affected_frame_ids_,
AttributeError: 'StopIteration' object has no attribute '_affected_frame_ids_'
Is there some difference between Python 2.x and 3.x that I'm overlooking? Is the problem somewhere else? Is this approach completely wrong? Thanks in advance!