1

I've come upon an unexpected error when multiprocessing with numpy arrays of different data types. First, I perform multiprocessing with numpy arrays of type int64 and run it again with numpy arrays of type float64. The int64 runs as expected whereas the float64 uses all available processors (more than I've allocated) and results in a slower computation than using a single core.

The following example reproduces the problem:

def array_multiplication(arr):

    new_arr = arr.copy()

    for nnn in range(3):

        new_arr = np.dot(new_arr, arr)

    return new_arr

if __name__ == '__main__':

    from multiprocessing import Pool
    import numpy as np
    from timeit import timeit

    # Example integer arrays.
    test_arr_1 = np.random.randint(100, size=(100, 100))
    test_arr_2 = np.random.randint(100, size=(100, 100))
    test_arr_3 = np.random.randint(100, size=(100, 100))
    test_arr_4 = np.random.randint(100, size=(100, 100))

    # Parameter array.
    parameter_arr = [test_arr_1, test_arr_2, test_arr_3, test_arr_4]

    pool = Pool(processes=len(parameter_arr))

    print('Multiprocessing time:')
    print(timeit(lambda: pool.map(array_multiplication, parameter_arr), 
          number=1000))

    print('Series time:')
    print(timeit(lambda: list(map(array_multiplication, parameter_arr)), 
          number=1000))

will yield

Multiprocessing speed:
4.1271785919998365
Series speed:
8.102764352000122

which is an expected speed-up.

However, replacing test_arr_n with

test_arr_1 = np.random.normal(50, 30, size=(100, 100))
test_arr_2 = np.random.normal(50, 30, size=(100, 100))
test_arr_3 = np.random.normal(50, 30, size=(100, 100))
test_arr_4 = np.random.normal(50, 30, size=(100, 100))

results in

Multiprocessing time:
2.379720258999896
Series time:
0.40820308100001057

in addition to using up all available processors, where I've specified 4. Below are screen grabs of the processor usage when running the first case (int64) and second case (float64).

Above is the int64 case, where four processors are given tasks followed by one processor computing the task in series.

However, in the float64 case, all processors are being used, even though the specified number is the number of test_arr's - that is, 4.

I have tried this for a number of array size magnitudes and number of iterations in the for loop in array_multiplicationand the behaviour is the same. I'm running Ubuntu 16.04 LTS with 62.8 GB of memory and an i7-6800k 3.40GHz CPU.

Why is this happening? Thanks in advance.

duncster94
  • 570
  • 6
  • 23
  • Just a clarification: Is your question more than asking why floating point calculations take more computational effort than integer calculations? – koalo Sep 13 '17 at 14:35
  • @koalo Yes. My question is: Why does the multiprocessing take more time than the series computation for `float64`, but more importantly, why are all the processors being used when only 4 are specified? – duncster94 Sep 13 '17 at 15:30

1 Answers1

2

That's expected behaviour.

Numpy uses BLAS internally (for some functions) which is highly optimized (Cache, SIMD, and depending on your implementation in use: Multithreading; some implementation-candidates Atlas, OpenBLAS, MKL) and is only slowed down with some outer multiprocessing (which has IO-based overhead and might hurt caching-behaviour too)!

Modern Ubuntu-versions come with a multithreaded BLAS implementation by default (earlier ones were limited to 1 or 2 threads).

The classic example of BLAS-based functions in numpy is np.dot().

Most BLAS-implementations (all i know; saw some discussion @Intel for adding some limited support for discrete types into MKL) only support floating-point types, so this is the reason, why these two codes behave differently (one highly optimized, the other is not; one hurt by multiprocessing, the other not).

Technically i would not call it an error, but it's an observation you describe!

Related question.

sascha
  • 32,238
  • 6
  • 68
  • 110
  • This is interesting. Would you say that, in effect, there is 'compounded' multiprocessing going on, as in, `Pool` is allocating the tasks to 4 processors which in turn are allocating the `np.dot()` task to `n` more processors, creating `4n` processes (or something like that)? When I specify 2 processes in `Pool` there is actually a speedup, whereas 1 results in the same time as the series computation and 3 or more are slower. – duncster94 Sep 13 '17 at 15:57
  • This probably depends on the scheduler of your OS, but that's what i would expect. But one has to be careful about the specifics, e.g. processor, core, thread. I also don't think your example is a good one (where 2 processes might be a good thing), probably due to the outer copy and loop. Classic vectorization be faster in most cases for operations like this (skip MT completely). You can also try some matrix-multiplication (big one) without any other code to observe the number of threads in use with your implementation. – sascha Sep 13 '17 at 16:03