5

I have an input file with containing a list of strings.

I am iterating through every fourth line starting on line two.

From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.

The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.

def method():
    target = open(output_file, 'w')

    with open(input_file, 'r') as f:
        lineCharsList = []

        for line in f:
            #Make string from first and last 6 characters of a line
            lineChars = line[0:6]+line[145:151] 

            if not (lineChars in lineCharsList):
                lineCharsList.append(lineChars)

                target.write(lineChars + '\n') #If string is unique, write to output file

            for skip in range(3): #Used to step through four lines at a time
                try:
                    check = line    #Check for additional lines in file
                    next(f)
                except StopIteration:
                    break
    target.close()
tomasyany
  • 1,132
  • 3
  • 15
  • 32
The Nightman
  • 5,609
  • 13
  • 41
  • 74
  • Im assuming the problem is once lineCharsList gets big, the script will get very slow. I don't have any suggestions but that's likely where the problem is. – Loocid Jul 09 '15 at 02:12
  • That is what I'm thinking as well. RAM shouldn't be a problem as I'm working on a computing cluster with plenty to spare. But I'm not sure if there is a better way to do this than just store everything in a list like this. – The Nightman Jul 09 '15 at 02:14
  • 1
    As an aside, you can include the output file in the ```with``` statement - ```with open(input_file, 'r') as f, open(output_file, 'w') as target:```. – wwii Jul 09 '15 at 02:25
  • What Python version are you using? – Veedrac Jul 09 '15 at 13:38

4 Answers4

6

Try defining lineCharsList as a set instead of a list:

lineCharsList = set()
...
lineCharsList.add(lineChars)

That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.

Óscar López
  • 232,561
  • 37
  • 312
  • 386
5

You can use https://docs.python.org/2/library/itertools.html#itertools.islice:

import itertools

def method():
    with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
        seen = set()
        for line in itertools.islice(inf, None, None, 4):
            s = line[:6]+line[-6:]
            if s not in seen:
                seen.add(s)
                ouf.write("{}\n".format(s))
dting
  • 38,604
  • 10
  • 95
  • 114
2

Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.

As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.

Community
  • 1
  • 1
lightalchemist
  • 10,031
  • 4
  • 47
  • 55
1

Try replacing

lineChars = line[0:6]+line[145:151]

with

lineChars = ''.join([line[0:6], line[145:151]])

as it can be more efficient, depending on the circumstances.

Doug
  • 3,472
  • 3
  • 21
  • 18