Executing python code in new process is much slower than on main process

Question

I started learning about multiprocessing in python and I noticed that same code is executed much faster on main process than in process which is created with multiprocessing module.

Here is simplified example of my code, where i first execute code on main process and print time for first 10 calculation and time for total calculation. And than same code is executed on new process (which is long running process at which i can send new_pattern at any time).

import multiprocessing
import random
import time


old_patterns = [[random.uniform(-1, 1) for _ in range(0, 10)] for _ in range(0, 2000)]
new_patterns = [[random.uniform(-1, 1) for _ in range(0, 10)] for _ in range(0, 100)]


new_pattern_for_processing = multiprocessing.Array('d', 10)
there_is_new_pattern = multiprocessing.Value('i', 0)
queue = multiprocessing.Queue()


def iterate_and_add(old_patterns, new_pattern):
    for each_pattern in old_patterns:
        sum = 0
        for count in range(0, 10):
            sum += each_pattern[count] + new_pattern[count]


print_count_main_process = 0
def patt_recognition_main_process(new_pattern):
    global print_count_main_process
    # START of same code on main process
    start_main_process_one_patt = time.time()
    iterate_and_add(old_patterns, new_pattern)
    if print_count_main_process < 10:
        print_count_main_process += 1
        print("Time on main process one pattern:", time.time() - start_main_process_one_patt)
    # END of same code on main process


def patt_recognition_new_process(old_patterns, new_pattern_on_new_proc, there_is_new_pattern, queue):
    print_count = 0
    while True:
        if there_is_new_pattern.value:
            #START of same code on new process
            start_new_process_one_patt = time.time()
            iterate_and_add(old_patterns, new_pattern_on_new_proc)
            if print_count < 10:
                print_count += 1
                print("Time on new process one pattern:", time.time() - start_new_process_one_patt)
            #END of same code on new process
            queue.put("DONE")
            there_is_new_pattern.value = 0


if __name__ == "__main__":
    start_main_process = time.time()
    for new_pattern in new_patterns:
        patt_recognition_main_process(new_pattern)
    print(".\n.\n.")
    print("Total Time on main process:", time.time() - start_main_process)

    print("\n###########################################################################\n")

    start_new_process = time.time()
    p1 = multiprocessing.Process(target=patt_recognition_new_process, args=(old_patterns, new_pattern_for_processing, there_is_new_pattern, queue))
    p1.start()
    for new_pattern in new_patterns:
        for idx, n in enumerate(new_pattern):
            new_pattern_for_processing[idx] = n
        there_is_new_pattern.value = 1
        while True:
            msg = queue.get()
            if msg == "DONE":
                break
    print(".\n.\n.")
    print("Total Time on new process:", time.time()-start_new_process)

And here is my result:

Time on main process one pattern: 0.0025289058685302734
Time on main process one pattern: 0.0020127296447753906
Time on main process one pattern: 0.002008199691772461
Time on main process one pattern: 0.002511262893676758
Time on main process one pattern: 0.0020067691802978516
Time on main process one pattern: 0.0020036697387695312
Time on main process one pattern: 0.0020072460174560547
Time on main process one pattern: 0.0019974708557128906
Time on main process one pattern: 0.001997232437133789
Time on main process one pattern: 0.0030074119567871094
.
.
.
Total Time on main process: 0.22810864448547363

###########################################################################

Time on new process one pattern: 0.03462791442871094
Time on new process one pattern: 0.03308463096618652
Time on new process one pattern: 0.034590721130371094
Time on new process one pattern: 0.033623456954956055
Time on new process one pattern: 0.03407788276672363
Time on new process one pattern: 0.03308820724487305
Time on new process one pattern: 0.03408670425415039
Time on new process one pattern: 0.0345921516418457
Time on new process one pattern: 0.03710794448852539
Time on new process one pattern: 0.03358912467956543
.
.
.
Total Time on new process: 4.0528037548065186

Why is there so big difference in execution time?

Maybe because of the `context switch` overhead? – Ami Hollander May 15 '18 at 06:29 — Ami Hollander, May 15 '18 at 06:29

score 4 · Accepted Answer · answered May 15 '18 at 07:29

Its a bit subtle, but the problem is with

new_pattern_for_processing = multiprocessing.Array('d', 10)

That doesn't hold python float objects, it holds raw bytes, in this case enough to hold 10 8-byte machine level double. When you read or write to this array, python must convert float to double or the other way around. This isn't a big deal if you are reading or writing once, but your code does it many times in a loop and those conversions dominate.

To confirm, I copied the machine level array to a list of python floats once and had the process work on that. Now its speed is the same as the parent. My changes were only in one function

def patt_recognition_new_process(old_patterns, new_pattern_on_new_proc, there_is_new_pattern, queue):
    print_count = 0
    while True:
        if there_is_new_pattern.value:
            local_pattern = new_pattern_on_new_proc[:]
            #START of same code on new process
            start_new_process_one_patt = time.time()
            #iterate_and_add(old_patterns, new_pattern_on_new_proc)
            iterate_and_add(old_patterns, local_pattern)
            if print_count < 10:
                print_count += 1
                print("Time on new process one pattern:", time.time() - start_new_process_one_patt)
            #END of same code on new process
            there_is_new_pattern.value = 0
            queue.put("DONE")

Oh thank you! It was driving me crazy for last three days. But I don't totally understand how it is converting to list so much faster than accessing element one by one. Do you have maybe link to any good explanation of that? :) — Domen Jakofčič, May 15 '18 at 07:54
It's the outer loop. You do it multiple times. When I created the list I only had the overhead once. — tdelaney, May 15 '18 at 15:13

score 0 · Answer 2 · answered May 15 '18 at 07:02

In this particular case, you seem to be doing sequential execution in another process, not parallelising your algorithm. This creates some overhead.

Process creation takes time in its own right. But this is not all. You are also transmitting data in queues and using Manager proxies. These are all in practice queues or actually two queues and another process. Queues are very, very slow compared to the use of in-memory copies of data.

If you take your code, execute it in another process and use queues to transmit data in and out, it is always slower. Which makes it pointless from performance point of view. There might be other reasons to do that nevertheless, for example if your main program was needed to do something else, for example wait on IO.

If you want a performance boost, you should create several processes instead and split your algorithm so that you process parts of your range in different processes, thus working in parallel. You could also consider Multiprocessing.Pool if you want to have a group of worker processes ready to wait for more work. This will reduce process creation overhead as you do it only once. In Python 3, you can also use ProcessPoolExecutor.

Parallel processing is useful but it is seldom snake oil that will solve all your problems with little effort. To benefit the most of it, you need to redesign your program to maximise parallel processing and minimise data transmission in queues.

Executing python code in new process is much slower than on main process

2 Answers2