How do I parallelize this code in python (GC calculator from FASTA)?

Question

System Monitor during process I am a novice when it comes to programming. I've worked through the book Practical Computing for Biologists and am playing around with some slightly more advanced concepts.

I've written a Python (2.7) script which reads in a .fasta file and calculates GC-content. The code is provided below.

The file I'm working with is cumbersome (~ 3.9 Gb), and I was wondering if there's a way to take advantage of multiple processors, or whether it would be worth-while. I have a four-core (hyperthreaded) Intel i-7 2600K processor.

I ran the code and looked at system resources (picture attached) to see what the load on my CPU is. Is this process CPU limited? Is it IO limited? These concepts are pretty new to me. I played around with the multiprocessing module and Pool(), to no avail (probably because my function returns a tuple).

Here's the code:

def GC_calc(InFile):
    Iteration = 0
    GC = 0
    Total = 0
    for Line in InFile:
        if Line[0] != ">":
            GC = GC + Line.count('G') + Line.count('C')
            Total = Total + len(Line)
            Iteration = Iteration + 1
            print Iteration
    GCC = 100 * GC / Total
    return (GC, Total, GCC)


InFileName = "WS_Genome_v1.fasta"
InFile = open(InFileName, 'r')
results = GC_calc(InFile)
print results

if you are reading from the same file, multiprocessing will not help you because you are almost guaranteed to be limited by the read speed of your hard drive. also, close your files... python's `with` keyword is great for making sure files get closed when you're done with them. — Aaron, Jan 23 '17 at 22:13
If you want help with multiprocessing in general just to learn however, you'll have to post your code that actually involves the parallel processing for us to be able to help you. If you aren't actually having any errors you might find more luck on [Code Review](http://codereview.stackexchange.com/) — Aaron, Jan 23 '17 at 22:18
@Aaron Ah, yes, the original code did have a close command - forgot it here! Thank you. I may try to generate some code with the pool() function (re-create it), but I did get an error message when I tried to split the function into 4 processes w/ pool.map. The problem was with how I tried to split the input (it didn't like that I used the pool.map function, along with my function, and tried to split my_file/4). I was following an example that generated random data and then split those data in that way (above), so my approach was clearly inappropriate. — canfiese, Jan 23 '17 at 22:41

score 0 · Answer 1 · edited Apr 13 '17 at 12:36

0

Currently, the major bottleneck of your code is print Iteration. Printing to stdout is really really slow. I would expect a major performance boost if you remove this line, or at least, if you absolutely need it, move it to another thread. However, thread management is an advanced topic and I would advice not to go into it right now.

Another possible bottleneck is the fact that you read data from a file. File IO can be slow, especially if you have a single HDD on your machine. With a single HDD you won't need to use multiprocessing at all, because you won't be able to provide enough data to processor cores. Performance-oriented RAIDs and SSDs can help here.

The final comment is to try to use grep and similar text-operating programs instead of python. They got through a decades of optimizing and have a good chance to work way more faster. There's a bunch of questions on SO where grep outperforms python. Or at least you can filter out FASTA headers before you pass data to the script:

$ grep "^[>]" WS_Genome_v1.fasta | python gc_calc.py

(Took grep "^[>]" from here.) In this case you shouldn't open files in the script, but rather read lines from sys.stdin, almost like you do it now.

edited Apr 13 '17 at 12:36

Community

1
1

answered Jan 23 '17 at 22:19

u354356007

3,205
15
25

That's very interesting! I'll try that out and compare performance. I did use grep to distill an alternate version of the file (with ">" lines and duplicate sequences removed), but didn't think to pipe it directly to the script. And the print statement is so that I can track progress and speed without taking the whole thing to completion, but I could see how that might slow things down! – canfiese Jan 23 '17 at 22:33
@canfiese - this question can help you reduce the amount of work for python even more: http://unix.stackexchange.com/q/6979 That's not really interesting in terms of learning python lol, but sometimes things end up like this. As someone said in comments, code review stackexchange will help you with refactoring and writing idiomatic python. – u354356007 Jan 23 '17 at 22:44
Thanks again. Just reporting back on your earlier suggestion(s): removing the print Iteration statement reduced processing time from ~ 715 seconds to ~ 32 seconds. That was extremely unexpected, but a valuable lesson. – canfiese Jan 23 '17 at 23:40
On the other hand, it seems as though passing the python script the grep command through a pipe resulted in a run-time of ~ 1500 seconds! It may have been that there was a variant of sys.stdin I should have used instead (I used sys.stdin.read()). – canfiese Jan 24 '17 at 05:05
@canfiese I'd use `for line in sys.stdin` loop instead and see what happens. I'm afraid `sys.stdin.read()` will attempt to read all 3.9 Gbytes of data before continuing to process it. – u354356007 Jan 24 '17 at 10:47
Ah, very cool. It took 41.01 seconds (the original script, minus print Iteration, took 37.62 seconds). So they appear to be more or less equivalent! Thanks! – canfiese Jan 28 '17 at 22:43

Roland Smith · Answer 2 · 2017-01-24T21:27:25.317

0

Basically you're counting the number of C and G in every line and you're calculating the length of the line. Only at the end you calculate a total.

Such a process is easy to do in parallel, because the calculation for each line is independent of the others.

Assuming the calculations are done in CPython (the one from python.org), threading won't improve performance much because of the GIL.

These calculations could be done in parallel with multiprocessing.Pool. Processes don't share data like threads do. And we don't want to send parts of a 3.9 GB file to each worker process! So you want each worker process to open the file by itself. The operating system's cache should take care that pages from the same file aren't loaded into memory multiple times.

If you have N cores, I would create the worker function so as to process every N-th line, with an offset.

def worker(arguments):
    n = os.cpu_count() + 1
    infile, offset = arguments
    with open(infile) as f:
        cg = 0
        totlen = 0
        count = 1
        for line in f:
            if (count % n) - offset == 0:
                if not line.startswith('>'):
                    cg += line.count('C') +
                          line.count('G')
                    totlen += len(line)
            count += 1
    return (cg, totlen)

You could run the pool like this;

import multiprocessing as mp
from os import cpu_count

pool = mp.Pool()
results = pool.map(worker, [('infile', n) for n in range(1, cpu_count()+1)])

By default, a Pool creates as many workers as the CPU has cores.

The results would be a list of (cg, len) tuples, which you can easily sum.

Edit: updated to fix modulo zero error.

edited Jan 24 '17 at 21:27

answered Jan 23 '17 at 22:45

Roland Smith

42,427
3
64
94

Hi @Roland Smith! I'm getting the following error: `Traceback (most recent call last): File "/home/sean/scripts/fasta_GC_MC_2.py", line 22, in results = pool.map(GC_calc, [('ex.fasta', n) for n in range(mp.cpu_count())]) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get raise self._value ZeroDivisionError: integer division or modulo by zero` – canfiese Jan 24 '17 at 00:31
^ Sorry for the mess...I'm still learning how to format these posts. :/ – canfiese Jan 24 '17 at 00:32
It must be because when it generates the n values, cpu_count returns the first integer as 0 (0 - 7). – canfiese Jan 24 '17 at 00:40
I get this: [(81, 140), (0, 0), (30, 38), (0, 0), (0, 32), (0, 0), (35, 38), (0, 0)] My processor has 4 cores, but it usually registers as 8 because of the hyperthreading. So every second "core" is unused. What's interesting is that 81 is the number of GC in my example fasta, and 140 is the total length of sequences. The next three cores are recording the GC and total count from the second, third, and fourth sequences (there are only four total). What might account for that interesting behavior? – canfiese Jan 24 '17 at 01:14
@canfiese I've updated the code so that both the range and the count start at 1. That should fix the divide by 0 error. Can you try again? Can you perhaps post your example FASTA file somewhere? – Roland Smith Jan 24 '17 at 20:52
@canfiese I've also changed the formula with the modulus. The original didn't work properly. – Roland Smith Jan 24 '17 at 21:29
This seems to work much better now! I need to take some time to understand how this works (so I can apply it to more complicated problems), but this time I definitely get a list of tuples which add up correctly! – canfiese Jan 28 '17 at 22:51
There's a behavior I hadn't noticed before. When I run this on fasta files with more than 4 sequences, the code starts to skip lines. I think this may have to do with the count/offset. When I process a file of 192 sequences, a total of 21 lines are skipped. I didn't catch it before because my example fasta file only had four sequences! – canfiese Jan 30 '17 at 01:17

score 0 · Answer 3 · answered Jan 30 '17 at 04:53

I now have a working code for parallelization (special thanks to @Roland Smith). I just had to make two small modifications to the code, and there's a caveat with respect to how .fasta files are structured. The final (working) code is below:

###ONLY WORKS WHEN THERE ARE NO BREAKS IN SEQUENCE LINES###

def GC_calc(arguments):
    n = mp.cpu_count()
    InFile, offset = arguments
    with open(InFile) as f:
        GC    = 0
        Total = 0
        count = 0
        for Line in f:
            if (count % n) - offset == 0:
                if Line[0] != ">":
                    Line = Line.strip('\n')
                    GC += Line.count('G') + Line.count('C')
                    Total += len(Line)
            count += 1
    return (GC, Total)

import time
import multiprocessing as mp

startTime = time.time()
pool = mp.Pool()
results = pool.map(GC_calc, [('WS_Genome_v2.fasta', n) for n in range(1, mp.cpu_count()+1)])
endTime = time.time()
workTime = endTime - startTime
#Takes the tuples, parses them out, adds them
GC_List = []
Tot_List = []
# x = GC count, y = total count: results = [(x,y), (x,y), (x,y),...(x,y)]
for x,y in results:
    GC_List.append(x)
    Tot_List.append(y)

GC_Final = sum(GC_List)
Tot_Final = sum(Tot_List)

GCC = 100*float(GC_Final)/float(Tot_Final)

print results
print
print "Number GC = ", GC_Final
print "Total bp = ", Tot_Final
print "GC Content = %.3f%%" % (GCC)
print
endTime = time.time()
workTime = endTime - startTime
print "The job took %.5f seconds to complete" % (workTime)

The caveat is that the .fasta files cannot have breaks within the sequences themselves. My original code didn't have an issue with it, but this code would not work properly when sequence was broken into multiple lines. That was simple enough to fix via the command line.

I also had to modify the code in two spots:

n = mp.cpu_count()

and

count = 0

Originally, count was set to 1 and n was set to mp.cpu_count()+1. This resulted in inaccurate counts, even after the file correction. The downside is that it also allowed all 8 cores (well, threads) to work. The new code only allows 4 to work at any given time.

But it DID speed up the process from about 23 seconds to about 13 seconds! So I'd say it was a success (except for the amount of time it took to correct the original .fasta file).

How do I parallelize this code in python (GC calculator from FASTA)?

3 Answers3