Python multiprocessing design

Question

I have written an algorithm that takes geospatial data and performs a number of steps. The input data are a shapefile of polygons and covariate rasters for a large raster study area (~150 million pixels). The steps are as follows:

Sample points from within polygons of the shapefile
For each sampling point, extract values from the covariate rasters
Build a predictive model on the sampling points
Extract covariates for target grid points
Apply predictive model to target grid
Write predictions to a set of output grids

The whole process needs to be iterated a number of times (say 100) but each iteration currently takes more than an hour when processed in series. For each iteration, the most time-consuming parts are step 4 and 5. Because the target grid is so large, I've been processing it a block (say 1000 rows) at a time.

I have a 6-core CPU with 32 Gb RAM, so within each iteration, I had a go at using Python's multiprocessing module with a Pool object to process a number of blocks simultaneously (steps 4 and 5) and then write the output (the predictions) to the common set of output grids using a callback function that calls a global output-writing function. This seems to work, but is no faster (actually, it's probably slower) than processing each block in series.

So my question is, is there a more efficient way to do it? I'm interested in the multiprocessing module's Queue class, but I'm not really sure how it works. For example, I'm wondering if it's more efficient to have a queue that carries out steps 4 and 5 then passes the results to another queue that carries out step 6. Or is this even what Queue is for?

Any pointers would be appreciated.

What is the maximum RSS during the entire process? Perhaps it'd be easier to just run this thing six times simultaneously if it will all fit into memory... — sarnold, Jun 01 '12 at 01:18
Are you really CPU bound? This looks like an I/O bound problem. — stark, Jun 01 '12 at 03:12
@Sarnold: It won't all fit into memory, that's the problem... — hendra, Jun 05 '12 at 00:41
@stark: Please elaborate. When I run the algorithm, only one core is being used and the others are pretty much idling. I figured I could use at least some of the other cores to get the job done a bit faster. — hendra, Jun 05 '12 at 00:43
@npo: you could run `vmstat 1` tool while running your program to see if your `bo` and `bi` columns are near the limit of your disks or if they are lower than your disk bandwidth most of the time. — sarnold, Jun 05 '12 at 00:50
@npo if it doesn't fit all into memory, you may be IO bound. sarnold's suggestion to run vmstat and check how badly (or not) are you swapping may be a good idea. I love Munin and would like to watch the machine during one run - it will graph things like what the CPU is doing (keep an eye on iowait - more is bad), how the memory is being used (keep an eye on caches - more is good) and how much IO the swap partition is being subjected to (any use is bad). Depending on what you find out you may want to pursue other alternatives, but I suspect more memory and a faster disk will help you a lot. — rbanffy, Jun 06 '12 at 19:19
Also (didn't fit on the other message) check how much IO the data disk is doing. — rbanffy, Jun 06 '12 at 19:28
My suggestion would be to use http://numpy.scipy.org/ - that would help to reduce both memory usage and cpu usage. — matiu, Jun 14 '12 at 15:12
Yeah, have you profiled the code first? That should be step 1. . . http://docs.python.org/library/profile.html — reptilicus, Jun 15 '12 at 16:12

Andrew Martinez · Accepted Answer · 2012-06-20T14:37:18.330

The current state of Python's multi-processing capabilities are not great for CPU bound processing. I fear to tell you that there is no way to make it run faster using the multiprocessing module nor is it your use of multiprocessing that is the problem.

The real problem is that Python is still bound by the rules of the GlobalInterpreterLock(GIL) (I highly suggest the slides). There have been some exciting theoretical and experimental advances on working around the GIL. Python 3.2 event contains a new GIL which solves some of the issues, but introduces others.

For now, it is faster to execute many Python process with a single serial thread than to attempt to run many threads within one process. This will allow you avoid issues of acquiring the GIL between threads (by effectively having more GILs). This however is only beneficial if the IPC overhead between your Python processes doesn't eclipse the benefits of the processing.

Eli Bendersky wrote a decent overview article on his experiences with attempting to make a CPU bound process run faster with multiprocessing.

It is worth noting that PEP 371 had the desire to 'side-step' the GIL with the introduction of the multiprocessing module (previously a non-standard packaged named pyProcessing). However the GIL still seems to play too large of a role in the Python interpreter to make it work well with CPU bound algorithms. Many different people have worked on removing/rewriting the GIL, but nothing has made enough traction to make it into a Python release.

I suspected that this would be the case. I'll definitely check out your links, thanks. On a slightly different tack: I am guessing then that the Parallel Python module would be subject to the same kinds of issues. But what if, say, PP was used to split the task among multiple computers? — hendra, Jun 20 '12 at 23:59

score 1 · Answer 2 · answered Jun 15 '12 at 22:31

Some of the multiprocessing examples at python.org are not very clear IMO, and it's easy to start off with a flawed design. Here's a simplistic example I made to get me started on a project:

import os, time, random, multiprocessing
def busyfunc(runseconds):
    starttime = int(time.clock())
    while 1:
        for randcount in range(0,100):
            testnum = random.randint(1, 10000000)
            newnum = testnum / 3.256
        newtime = int(time.clock())
        if newtime - starttime > runseconds:
            return

def main(arg):
    print 'arg from init:', arg
    print "I am " + multiprocessing.current_process().name

    busyfunc(15)

if __name__ == '__main__':

    p = multiprocessing.Process(name = "One", target=main, args=('passed_arg1',))
    p.start()

    p = multiprocessing.Process(name = "Two", target=main, args=('passed_arg2',))
    p.start()

    p = multiprocessing.Process(name = "Three", target=main, args=('passed_arg3',))
    p.start()

    time.sleep(5)

This should exercise 3 processors for 15 seconds. It should be easy to modify it for more. Maybe this will help to debug your current code and ensure you are really generating multiple independent processes.

If you must share data due to RAM limitations, then I suggest this: http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

score 1 · Answer 3 · answered Jun 20 '12 at 08:32

As python is not really meant to do intensive number-cunching, I typically start converting time-critical parts of a python program to C/C++ and speed things up a lot.

Also, the python multithreading is not very good. Python keeps using a global semaphore for all kinds of things. So even when you use the Threads that python offers, things won't get faster. The threads are useful for applications, where threads will typically wait for things like IO.

When making a C module, you can manually release the global semaphore when processing your data (then, of course, do not access the python values anymore).

It takes some practise using the C API, but's its clearly structured and much easier to use than, for example, the Java native API.

See 'extending and embedding' in the python documentation.

This way you can make the time critical parts in C/C++, and the slower parts with faster programming work in python...

score 0 · Answer 4 · answered Jun 11 '12 at 05:52

I recommend you first check which aspects of your code is taking the most time, so your gonna have to profile it, I've used http://packages.python.org/line_profiler/#line-profiler with much success, though it does require cython.

As for Queues, their mostly used for sharing data/synchronizing threads, though I've rarely used it. I do use multiprocessing all the time.

I mostly follow the map reduce philosophy, which is simple and clean but it has some major overhead, since values have to be packed into dictionaries and copied across each process, when applying the map function ...

You can try segmenting your file and applying your algorithm to different sets.

Python multiprocessing design

4 Answers4