1

I wrote the following code to define blocks of 4 lines in a text file and output the block if the 2nd line of the block is composed of only one type of character. It is assumed (and previously verified) that the 2nd line is always composed of a string of 36 characters.

# filter out homogeneous reads

import sys
import collections
from collections import Counter

filename1 = sys.argv[1] # file to process

with open(filename1,'r') as input_file:
    for line1 in input_file:
        line2, line3, line4 = [next(input_file) for line in xrange(3)]
        c = Counter(line2).values() # count characters in line2
        c.sort(reverse=True) # sort values in descending order
        if c[0] < 36:
            print line1 + line2 + line3 + line4.rstrip()

However, I am getting a StopIteration error as follows. I would appreciate if someone could tell me why.

$ python code.py test.file > testout.file
Traceback (most recent call last):
  File "code.py", line 11, in <module>
    line2, line3, line4 = [next(input_file) for line in xrange(3)]
StopIteration

Any help would be appreciated, especially of the kind that explains what is wrong with my specific code and how to fix it. Here is an example of input:

@1:1:1323:1032:Y
AGCAGCATTGTACAGGGCTATCATGGAATTCTCGGG
+1:1:1323:1032:Y
HHHBHHBHBHGBGGGH8HHHGGGGFHBHHHHBHHHH
@1:1:1610:1033:Y
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+1:1:1610:1033:Y
HHEHHHHHHHHHHHBGGD>GGD@G8GGGGDHBHH4C
@1:1:1679:1032:Y
CGGTGGATCACTCGGCTCGTGCGTCGATGAAGAACG
biohazard
  • 2,017
  • 10
  • 28
  • 41
  • 1
    Both the `for` loop and list comprehension are iterating over the file, perhaps you should rationalise that down to a single loop? – jonrsharpe Dec 23 '15 at 11:03
  • You already have an implicit `next(input_file)` in your for loop; are you accounting for the off-by-one? – Burhan Khalid Dec 23 '15 at 11:05
  • 2
    Is the number of lines in your file divisible by `4`? – Mike Müller Dec 23 '15 at 11:06
  • Damn... Yes, I produced that test input with the `head` command default settings... However, @poke 's answer below was very instructive, and points to other shortcomings in my code, which is quite nice :) – biohazard Dec 23 '15 at 11:20

3 Answers3

2

Your example input already shows the problem: You have 10 lines there, which is not divisble by 4. So as you read the very last block, you get line1 and line2 but for the next() call for line3, the input is exhausted and you get nothing.

It’s likely that you have the same issue in your full input file as well: The number of lines is simply not divisible by 4.

There are a few ways to overcome this. The best is probably to fix your input since you seem to be expecting four lines all the way, there seems to be a content problem if that’s not what the input file gives.

Another very simple fix would be to specify a default value with next():

line2, line3, line4 = [next(input_file, '') for line in xrange(3)]

Now, when next() would fail, the default value '' is instead returned. So even if the file is exhausted, you still get some content back.

A probably better solution however would be to fix the way you iterate the file. You have two locations where you access the same file iterator, once in the outer for loop and three times in the list comprehension. It may seem simple enough so you won’t run into other problems, but you should really try to change this so that you only have a single location where you walk through the iterator; or only ever use next() calls, but mixing it with a for loop seems like a bad idea.

You could for example use the grouper itertools recipe to cleanly iterate the file in groups of four:

with open(filename1, 'r') as input_file:
    for line1, line2, line3, line4 in grouper(input_file, 4, fillvalue=''):
        # do things with the lines
poke
  • 369,085
  • 72
  • 557
  • 602
  • "or only ever use next() calls" <-- I'm sorry, what do you mean by this? I am currently trying out your solution, but I think I need to check out `itertools` first. – biohazard Dec 23 '15 at 11:36
  • 1
    With that I mean what Mike Müller showed in the second example; my point is that you shouldn’t have both a for loop, and individual `next()` calls iterate the file. Instead, you should have only single way to iterate the file, e.g. either using a for loop, or with `next()` calls. That way you always know exactly when and why the file is iterated. – poke Dec 23 '15 at 11:38
  • I really tried, but I wasn't able to use the `grouper()` function even after checking out the link you provided. Maybe I am too much of a novice, but Mike Müller's solution was the simplest and easiest for me, so after careful consideration, I will pick that one (Abu's one was similar, but there is no need to print error messages to output). Thanks again for your help! – biohazard Dec 23 '15 at 12:18
  • Sure, it’s totally up to you which solution you end up using :) As for the grouper function, the way you use it is by copying the function definition to your own code; in addition, you need to add the following import to your script: `from itertools import izip_longest`. – poke Dec 23 '15 at 12:20
  • 1
    (Btw. accepting an answer does not necessarily need to match which solution you end up using. The [general opinion](http://meta.stackexchange.com/a/5235/141542) is that you should accept the one that was most helpful to you.) – poke Dec 23 '15 at 12:21
  • Ah, ok! First, I tried `from itertools import grouper`, after that didn't work I copied the function to my script and got a `izip_longest` undefined error and gave up :D. Is this solution significantly faster than `next()` for big files? – biohazard Dec 23 '15 at 12:23
  • No, it’s just a helper function to make it more usable, there will be no real performance difference. It’s just so you don’t have to meddle with iterating the file by hand ;) – poke Dec 23 '15 at 12:26
  • Oh, ok. Thanks! I'll think about it when I need to iterate bigger blocks with more lines. :) – biohazard Dec 23 '15 at 12:27
1

You will get this if the number of lines in your file cannot by divided by 4 without remainder. Then you will try read a line that does not exist. You need to count empty lines.

One solution would be to stop processing the file if the number of lines is not enough for processing:

try:
    line2, line3, line4 = [next(input_file) for line in xrange(3)]
except StopIteration:
    break

This feels a bit cleaner:

while True:
    try:
        line1, line2, line3, line4 = [next(input_file) for line in xrange(4)]
except StopIteration:
    break

because you progress the iterator only at one place not at two.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
1

You have 10 lines so it can iterate 2times and then there's 2 lines shortage. This is where Python can not read enough lines and throws StopIteration.

Checkout this code, I slightly updated it:

import sys
import collections
from collections import Counter

filename1 = sys.argv[1] # file to process

with open(filename1,'r') as input_file:
    while True:
        try:
            line1, line2, line3, line4 = [next(input_file) for line in xrange(4)]
        except StopIteration:
            print "Not enough lines to read!"
            break

        c = Counter(line2).values() # count characters in line2
        c.sort(reverse=True) # sort values in descending order
        if c[0] < 36:
            print line1 + line2 + line3 + line4.rstrip()
        else:
            print "Skipping 4 lines since less than 36 characters"
masnun
  • 11,635
  • 4
  • 39
  • 50
  • @poke The reason why I liked this reply because it taught me something about how to use exceptions. @Abu Ashraf Masnun shouldn't it be `xrange(3)`? – biohazard Dec 23 '15 at 11:32
  • I suggest you mark @poke's answer since he detailed it more than me. I'm okay with not getting my answer being selected. – masnun Dec 23 '15 at 11:32
  • No, if you notice, I am using a `while` loop and moved the `line1` with the others. So I have to iterate once more. That is why it's `xrange(4)`. – masnun Dec 23 '15 at 11:33
  • The exception seems to add the error message to the end of the file even when the computing was successful (the last block of lines is logically followed by a StopIteration) – biohazard Dec 23 '15 at 12:07
  • You remove the print message. – masnun Dec 23 '15 at 12:08