8

Original Question

I am trying to use multiprocessing Pool in Python. This is my code:

def f(x):
    return x
    
def foo():
    p = multiprocessing.Pool()
    mapper = p.imap_unordered
    
    for x in xrange(1, 11):
        res = list(mapper(f,bar(x)))

This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?

How can I resolve this problem?

minimal, complete, verifiable example

To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.

#!/usr/bin/python

import time
import itertools
import threading
import multiprocessing
import random
    
    
def f(x):
    return x

def ngrams(input_tmp, n):
    input = input_tmp.split()

    if n > len(input):
        n = len(input)
        
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output 
    
def foo():
    
    p = multiprocessing.Pool()
    mapper = p.imap_unordered
    
    num = 100000000 #100
    rand_list = random.sample(xrange(100000000), num)
    
    rand_str = ' '.join(str(i) for i in rand_list)

    for n in xrange(1, 100):
        res = list(mapper(f, ngrams(rand_str, n)))

        
if __name__ == '__main__':
    start = time.time()
    foo()
    print 'Total time taken: '+str(time.time() - start)

When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.

Caution: When num is too large it may crash your system/VM.

Community
  • 1
  • 1
Maggie
  • 5,923
  • 8
  • 41
  • 56
  • What does `bar` return? – Peter Wood May 07 '15 at 07:50
  • `bar` returns an `igraph` object and `x` determines the depth of the graph. Will that have any effect on this? – Maggie May 07 '15 at 07:54
  • If the `igraph` is big enough, it may be spending more time pickling the parameters and results and pushing them over the pipes than doing actual work (_especially_ if your actual work is just `return x`!), which would definitely serialize everything to one CPU. And it's certainly plausible that a graph of depth 5 wouldn't have this problem, but a graph of depth 9 would. – abarnert May 07 '15 at 07:56
  • 2
    Can you give us a [minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) with some good fake data? Then we might be able to analyze and/or test it rather than having to guess. – abarnert May 07 '15 at 07:59
  • Thanks @abarnert, I have updated the question, I am unable to provide the original problem but managed to create a dummy program that’s kinda replicate my original problem. – Maggie May 07 '15 at 08:34
  • OK, when I turn off hyperthreading and run this, core 0 goes to 100%, and the other cores stay under 15%, so it looks like a great repro case. – abarnert May 07 '15 at 08:41
  • PS, if your thetans are clogged with ngrams, isn't that what Scientology is for? :) – abarnert May 07 '15 at 08:58
  • 2
    OK, it definitely is what I thought. You're spending about 75% of your time fighting over the queues, 25% of your time on `multiprocessing` overhead, and 0% of your time doing actual work. In 2.7, the overhead is serialized, so everything happens on one core. In 3.4, the overhead is parallelized, so your other cores have a bit of work to do, but it's still just overhead (useless work). – abarnert May 07 '15 at 09:17

2 Answers2

8

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.

If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.

So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.

Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).

From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.

This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.

If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.

But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)

Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.

And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.

A few observations:

  • In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
  • If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
  • If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
  • If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
  • Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
  • It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
abarnert
  • 354,177
  • 51
  • 601
  • 671
2

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.

ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.

To make things clearer, consider that this piece of code...

for n in xrange(1, 100):
    res = list(mapper(f, ngrams(rand_str, n)))

...is equivalent to this:

for n in xrange(1, 100):
    arg = ngrams(rand_str, n)
    res = list(mapper(f, arg))

Also the following is a CPU-intensive operation that is being performed on your main process:

num = 100000000
rand_list = random.sample(xrange(100000000), num)

You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69
  • But `ngrams` finishes before anything gets mapped at all, and even after it finishes, the whole thing is still on one core, so this isn't the explanation. – abarnert May 07 '15 at 08:47
  • @abarnert: starting a huge number of processes that do nothing but exit soon will never put a high load on the CPUs. – Andrea Corbellini May 07 '15 at 08:50
  • He's not starting a huge number of processes, he's starting 8 processes in a pool. And they don't do nothing—they pull a huge value off a shared queue, unpickle it, repickle it, and put it on a shared queue. And that definitely puts a high load on the CPUs—but only on one core. – abarnert May 07 '15 at 08:53
  • The question is, why only on one core? And I'm pretty sure the answer is either (a) the queue contention itself swamps the pickling work, or (b) the pickling is happening with the queue locked; I'm still trying to rule out (b). – abarnert May 07 '15 at 08:59