python multiprocessing module: strange behaviour and processor load when using Pool

Question

I'm using Python's multiprocessing lib to speed up some code (least squares fitting with scipy).

It works fine on 3 different machines, but it shows a strange behaviour on a 4th machine.

The code:

import numpy as np
from scipy.optimize import least_squares
import time
import parmap
from multiprocessing import Pool

p0 = [1., 1., 0.5]

def f(p, xx):
    return p[0]*np.exp(-xx ** 2 / p[1] ** 2) + p[2]

def errorfunc(p, xx, yy):
    return f(p, xx) - yy

def do_fit(yy, xx):
    return least_squares(errorfunc, p0[:], args=(xx, yy))

if __name__ == '__main__':
    # create data
    x = np.linspace(-10, 10, 1000)
    y = []
    np.random.seed(42)
    for i in range(1000):
        y.append(f([np.random.rand(1) * 10, np.random.rand(1), 0.], x) + np.random.rand(len(x)))

    # fit without multiprocessing
    t1 = time.time()
    for y_data in y:
        p1 = least_squares(errorfunc, p0[:], args=(x, y_data))
    t2 = time.time()
    print t2 - t1

    # fit with multiprocessing lib
    times = []
    for p in range(1,13):
        my_pool = Pool(p)
        t3 = time.time()
        results = parmap.map(do_fit, y, x, pool=my_pool)
        t4 = time.time()
        times.append(t4-t3)
        my_pool.close()
    print times

For the 3 machines where it works, it speeds up roughly in the expected way. E.g. on my i7 laptop it gives:

[4.92650294303894, 2.5883090496063232, 1.7945551872253418, 1.629533052444458, 
1.4896039962768555, 1.3550388813018799, 1.1796400547027588, 1.1852478981018066, 
1.1404039859771729, 1.2239141464233398, 1.1676840782165527, 1.1416618824005127]

I'm running Ubuntu 14.10, Python 2.7.6, numpy 1.11.0 and scipy 0.17.0. I tested it on another Ubuntu machine, a Dell PowerEdge R210 with similar results and on a MacBook Pro Retina (here with Python 2.7.11, and same numpy and scipy versions).

The computer that causes issues is a PowerEdge R710 (two hexcores) running Ubuntu 15.10, Python 2.7.11 and same numpy and scipy version as above. However, I don't observe any speedup. Times are around 6 seconds, no matter what poolsize I use. In fact, it is slightly better for a poolsize of 2 and gets worse for more processes.

htop shows that somehow more processes get spawned than I would expect.

E.g. on my laptop htop shows one entry per process (which matches the poolsize) and eventually each process shows 100% CPU load.

On the PowerEdge R710 I see about 8 python processes for a poolsize of 1 and about 20 processes for a poolsize of 2 etc. each of which shows 100% CPU load.

I checked BIOS settings of the R710 and I couldn't find anything unusual. What should I look for?

EDIT: Answering to the comment, I used another simple script. Surprisingly this one seems to 'work' for all machines:

from multiprocessing import Pool
import time
import math
import numpy as np

def f_np(x):
    return x**np.sin(x)+np.fabs(np.cos(x))**np.arctan(x)

def f(x):
    return x**math.sin(x)+math.fabs(math.cos(x))**math.atan(x)

if __name__ == '__main__':
    print "#pool", ", numpy", ", pure python"
    for p in range(1,9):
        pool = Pool(processes=p)
        np.random.seed(42)
        a = np.random.rand(1000,1000)
        t1 = time.time()
        for i in range(5):
            pool.map(f_np, a)
        t2 = time.time()
        for i in range(5):
            pool.map(f, range(1000000))
        print p, t2-t1, time.time()-t2
        pool.close()

gives:

#pool , numpy , pure python
1 1.34186911583 5.87641906738
2 0.697530984879 3.16030216217
3 0.470160961151 2.20742988586
4 0.35701417923 1.73128080368
5 0.308979988098 1.47339701653
6 0.286448001862 1.37223601341
7 0.274246931076 1.27663207054
8 0.245123147964 1.24748778343

on the machine that caused the trouble. There are no more threads (or processes?) spawned than I would expect.

It looks like numpy is not the problem, but as soon as I use scipy.optimize.least_squares the issue arises.

Using on htop on the processes shows a lot of sched_yield() calls which I don't see if I don't use scipy.optimize.least_squares and which I also don't see on my laptop even when using least_squares.

Since you're using `htop`, try selecting each of the 8 Python processes on the bad machine and pressing `s` for `strace`. See what each process is doing with that 100% CPU time. Also, try a simpler test program which doesn't use Numpy at all. — John Zwinck, Apr 19 '16 at 03:07
I edited my original post. I used the more complex example in the first place because it is more close to my final problem. — Julian S., Apr 19 '16 at 23:00
Let's try to figure out where those processes are spawned, shall we? You can launch your program under a debugger, e.g. `gdb --args python yourscript.py`... then `break fork` to stop when a new process is created, then see what's doing the forking. It might be the underlying C or Fortran routines inside Scipy. See https://sourceware.org/gdb/onlinedocs/gdb/Forks.html — John Zwinck, Apr 20 '16 at 04:21

Julian S. · Accepted Answer · 2016-04-26T15:44:39.773

4

According to here, there is an issue when OpenBLAS is used together with joblib.

Similar issues occur when MKL is used (see here). The solution given here, also worked for me: Adding

import os
os.environ['MKL_NUM_THREADS'] = '1'

at the beginning of my python script solves the issue.

edited Apr 26 '16 at 15:44

answered Apr 21 '16 at 16:21

Julian S.

440
4
14

python multiprocessing module: strange behaviour and processor load when using Pool

1 Answers1