3

I'm learning to use pool with multiprocessing. I did this script as an exercise.

Can anyone tell me why using a normal for loop took less time than using a pool?

P.S: My CPU has 2 cores.

Thank you very much.

from multiprocessing import Pool
from functools import reduce
import time

def one(n):
    a = n*n
    return a 

if __name__ == '__main__':
    l = list(range(1000))

    p = Pool()
    t = time.time()
    pol = p.map(one, l)
    result = reduce(lambda x,y: x+y, pol)
    print("Using Pool the result is: ", result, "Time: ", time.time() - t )
    p.close()
    p.join()

    def two(n):
        t = time.time()
        p_result = [] 

        for i in n:
            a = i*i 
            p_result.append(a)

        result = reduce(lambda x,y: x+y, p_result)
        print("Not using Pool the result is: ", result, "Time: ", time.time() - t)

    two(l)

Using Pool the result is: 332833500 Time: 0.14810872077941895

Not using Pool the result is: 332833500 Time: 0.0005018711090087891

max fraguas
  • 438
  • 1
  • 6
  • 12

3 Answers3

4

I think there are several reasons at play here, but I would guess that it largely has to do with the overhead of running multiple processes, which mostly has to do with synchronization and communication, as well as the fact that your non-parallelized code is written a bit more efficiently.

As a basis, here is how your unmodified code runs on my computer:

('Using Pool the result is: ', 332833500, 'Time: ', 0.0009129047393798828)
('Not using Pool the result is: ', 332833500, 'Time: ', 0.000598907470703125)

First of all, I would like to try to level the playing field by making the code of the two() function nearly identical to the parallelized code. Here is the modified two() function:

def two(l):
    t = time.time()

    p_result = map(one, l)

    result = reduce(lambda x,y: x+y, p_result)
    print("Not using Pool the result is: ", result, "Time: ", time.time() - t)

Now, this does not actually make a whole lot of difference in this case, but it will be important in a second to see that both cases are doing the exact same thing. Here is a sample output with this change:

('Using Pool the result is: ', 332833500, 'Time: ', 0.0009338855743408203)
('Not using Pool the result is: ', 332833500, 'Time: ', 0.0006031990051269531)

What I would like to illustrate now is that since the one() function is so computationally cheap, the overhead of the inter-process communication is outweighing the benefit of running it in parallel. I will modify the one() function as follows to force it to do a bunch of extra computation. Note that because of the changes to the two() function, this change will affect both the parallel and the single-threaded code.

def one(n):
    for i in range(100000):
        a = n*n
    return a

The reason for the for loop is to give each process a reason for existence. As you have your original code, each process simply does several multiplications, and then has to send the list of results back to the parent process, and wait to be given a new chunk. It takes much longer to send and wait than it does to complete a single chunk. By adding these extra cycles, it forces each chunk to take longer, without changing the time needed for inter-process communication, and so we begin to see the parallelism pay off. Here are my results when I run the code with this change to the one() function:

('Using Pool the result is: ', 332833500, 'Time: ', 1.861448049545288)
('Not using Pool the result is: ', 332833500, 'Time: ', 3.444211959838867)

So there you have it. All you need is to give your child processes a bit more work, and they will be more worth your while.

TallChuck
  • 1,725
  • 11
  • 28
-2

When using Pool, python use a global interpreter lock to synchronize multiple thread among multiple processes. That is, when one thread is running, all the other threads are stopped/waiting. Therefore, what you will experience is a sequential execution, not a parallel one. In your example, even if you distribute among multiple threads in the pool, they run sequentially due to the global interpreter lock. Also this adds a lot of overhead on scheduling as well.

From python docs on global interpreter lock:

The mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines.

Therefore, what you achieve is not true parallelism. If you need to achieve real multiprocessing capabilities in python, you need to use Processes and this will cause you to use Queues to exchange data between processes.

Imesha Sudasingha
  • 3,462
  • 1
  • 23
  • 34
  • the Global Interpreter Lock is only operative within a single process. If you run a `ps` while you have a `Pool` going, you will see that is is in fact using multiple processes – TallChuck Apr 20 '18 at 04:40
  • Yes, your argument is correct. But as per my personal experience, `global interpreter lock` affects when using threads instead of processes. What do you think? – Imesha Sudasingha Apr 20 '18 at 04:52
  • well sure, but the code presented does not use threads, except insofar as the `threading` library is used to handle the processes within the `multiprocessing` library – TallChuck Apr 20 '18 at 05:03
  • I see. Thanks for the details. – Imesha Sudasingha May 08 '18 at 11:03
-3

The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

Found in chapter Process-based “threading” interface in the Python 2.7.16 documentation

Jerry
  • 55
  • 10
  • It looks like you copied that from https://docs.python.org/2/library/multiprocessing.html and if you do so please state the source and make 100% clear the text is not yours. Also prefer to not write answers that are nothing more the a copy of text of an off-site resource. Use them as back-up, not as the core of your answer. – rene Jun 02 '19 at 15:44