Speed per process getting slower with more processes

Question

I am trying to improve the speed of some code with multiprocess. And I noticed the speed does not increase as expected. I know there are overheads for the spawn of child processes and there are overheads for data transfer between the parent process and child processes. However, even after I minimized the overheads, the performance with multiprocess is still not what I expected. So I write a simple test code:

import multiprocessing
import numpy as np
import time

def test_function():
    start_time = time.time()
    n = 1000
    x = np.random.rand(n,n)
    p = np.random.rand(n,n)
    y = 0
    for i in range(n):
        for j in range(n):
            y += np.power(x[i][j], p[i][j])

    print ("= Running time:",time.time()-start_time)
    return 

def main():
    procs = [1,2,3,4,5,6]
    for proc in procs:
        print("Number of process:", proc)
        pool = multiprocessing.Pool(processes=proc)
        para = [(),] * proc
        pool.starmap(test_function,para)
        pool.close()
        pool.join()

if __name__ == '__main__':
    main()

You can see that the test function only has two loops and some mathematics computations. There are no data transfer between the main process and the children process, and the time is calculated inside the child process, so no overhead will be included. And here is the output:

Number of process: 1
= Running time: 4.253360033035278
Number of process: 2
= Running time: 4.404280185699463
= Running time: 4.411274671554565
Number of process: 3
= Running time: 4.580170154571533
= Running time: 4.59316349029541
= Running time: 4.610152959823608
Number of process: 4
= Running time: 4.908967733383179
= Running time: 4.926954030990601
= Running time: 4.997913122177124
= Running time: 5.09885048866272
Number of process: 5
= Running time: 5.406658172607422
= Running time: 5.441636562347412
= Running time: 5.4576287269592285
= Running time: 5.473618030548096
= Running time: 5.621527671813965
Number of process: 6
= Running time: 6.195171594619751
= Running time: 6.225149869918823
= Running time: 6.256133079528809
= Running time: 6.290108919143677
= Running time: 6.339082717895508
= Running time: 6.3710620403289795

The code is executed under Windows 10 with i7 CPU with 4 cores,8 logic processes. Obviously the running time for each process is increasing as the number of process increases. Is this caused by the operating system or the limitation of the CPU itself or other hardware as well?

Update: here is the out put in Linux environment. It is interesting to see that with 5 processes, 2 processes time have big jump and with 6 processes, 4 processes time have big jump. It seems that it is related with the logic processors? The physical cores need to switch/swap sources for the logic processors?

Number of process: 1
= Running time: 4.039047479629517
Number of process: 2
= Running time: 4.150756597518921
= Running time: 4.159530878067017
Number of process: 3
= Running time: 4.228744745254517
= Running time: 4.261997938156128
= Running time: 4.324823379516602
Number of process: 4
= Running time: 4.342475891113281
= Running time: 4.347326755523682
= Running time: 4.350982427597046
= Running time: 4.370999574661255
Number of process: 5
= Running time: 4.369337797164917
= Running time: 4.391499757766724
= Running time: 4.43767237663269
= Running time: 6.300408124923706
= Running time: 6.31215763092041
Number of process: 6
= Running time: 4.366948366165161
= Running time: 4.38712739944458
= Running time: 6.366809844970703
= Running time: 6.370593786239624
= Running time: 6.422687530517578
= Running time: 6.433435916900635

score 0 · Answer 1 · answered Mar 27 '21 at 15:12

0

Short answer: the low-population increases are likely caused by the OS, but you haven't provided the needed data for analysis.

Long answer: would entail an introduction to operating systems.

In your post, you claim

the time is calculated inside the child process, so no overhead will be included.

This is false. You calculate elapsed time, a.k.a. "wall-clock time". Any OS overhead is included in that time: garbage collection, context swapping, etc.

To properly profile your system, you need to profile your system, not merely one application. What else is running on your system while these processes execute? Since this is Windows, it's virtually guaranteed that your four cores have things to do other than the Python RTE (run-time environment). To understand what happens in your multi-process application, run with a dynamic profiler and watch what processes are active as the Python processes run. Graph the activity by process or job; I expect that you'll see several system services working as well.

For a simpler, less accurate metric of process activity, look up how to extract from Windows the CPU consumption for each process.

answered Mar 27 '21 at 15:12

Prune

76,765
14
60
81

When I am talking about 'overheads" I mean the time for the system to spawn child processes and close child processes after the multiprocessing is done. And I don't think the garbage collection and context swapping have anything to do with this slowing down since the timer is started and finished inside the child process. Of cause I don't believe the CPU is running slower with more processes, I just don't know what the extra time was spend by the CPU. The time I cared is the wall-clock time because that's the time I am waiting for the program to run and to finish. – Y. Zhang Mar 27 '21 at 23:59
I know there are many background processes and services, so I make sure no big job at the background which requests big I/O or CPU time when I run the test and the Task Manager shows ~2% CPU load. I intentionally setup the maximum process as 6 so there are still 2 processes to take care of other stuffs. Even though, the wall-time is steady increasing as the number of process increases. With 6 processes, the time is 50% higher than that with 1 processes. If that is caused by system services, it means more than 15-20% CPU load at the background all the time, which is obviously not the case. – Y. Zhang Mar 28 '21 at 00:05
Okay, you're thinking about the right things. However, without hard data posted here, *we* are just guessing. That makes this a poor fit for Stack Overflow. – Prune Mar 28 '21 at 01:19

score 0 · Answer 2 · answered Aug 09 '21 at 04:09

0

Have you figured out the problem? I had the same issue with multiprocessing. I found that if you add a certain delay (not too small) between different processes, the time consumption of each process will reduce down to same value as that of one parallel process. However, we end up gaining nothing from multiprocessing due to the delay. It's really confusing.

import multiprocessing as mp
import numpy as np
import time

def test_function():
    start_time = time.time()
    n = 1000
    x = np.random.rand(n,n)
    p = np.random.rand(n,n)
    y = 0
    for i in range(n):
        for j in range(n):
            y += np.power(x[i][j], p[i][j])

    print ("= Running time:",time.time()-start_time)
    return 

def main():
    N = 6
    procs = []
    for ii in range(N):
        procs.append(mp.Process(target=test_function))
    
    for p in procs:
        p.start()
        time.sleep(2)
    
    for p in procs:
        p.join()

if __name__ == '__main__':
    main()

answered Aug 09 '21 at 04:09

Xiaowei

11
1

I think your problem is different from mine. I have a guess about my problem. Modern CPU unit have multi cores. Each core can split into threads, typical 2 thread/core. My i7 CPU has 4 cores, 8 threads. The cores are relatively independent from each other physically, while the 2 threads in one core are just logically independent and they need to share/compete for resources. Therefore a task runs in a thread in a dedicated core is faster than the task runs in thread sharing the core with other busy threads. – Y. Zhang Dec 16 '22 at 16:50
Therefore, if I launch 4 processes, the OS will assign the 4 processes to 4 threads in 4 different cores. The 4 processes can all run at the top speed. If I launch more than 4 processes, some cores will have both threads fully occupied and the processes in those threads will slow down. – Y. Zhang Dec 16 '22 at 16:52
In you problem, the delay is caused by the OS to locate resource for processes. Every time a new process is requite, it will take some over head for the OS to create it. In my experience it could take seconds or even longer. If your task only need less than 20 seconds cpu time, it seems not worth to use 6 processes because of the over head. But if you task need very long time to finish, e.g 60 minutes computation, then with 6 processes it should significantly reduce the running time of the program. – Y. Zhang Dec 16 '22 at 17:16
If you task consists of many small jobs, e.g 40000 jobs and only 40 cpus available, don't use 40000 Processes, instead, using multiprocessing.Pool, which can save a lot time. Here is a nice article about Process vs. Pool, (https://superfastpython.com/multiprocessing-pool-vs-process)[link](https://superfastpython.com/multiprocessing-pool-vs-process/) – Y. Zhang Dec 16 '22 at 17:20

score 0 · Answer 3 · answered Aug 09 '21 at 05:27

I think we should not focus on the executing time of each single process. Instead, it is more meaningful to look at the total time consumption for a certain job. Please check out the following code:

import multiprocessing as mp
import numpy as np
import time

def test_function(x,p,N_loop):
    start_time = time.time()
    y = 0
    for i in range(N_loop):
        y += np.power(x, p)
    print ("= Running time:",time.time()-start_time)
    return 

def main():
    N_total = 6000              # total loops
    N_core = 6                  # number of processes
    Ni = int(N_total/N_core)    # loops for each process

    # data
    n = 200
    x = np.random.rand(n,n)
    p = np.random.rand(n,n)

    procs = []
    for ii in range(N):
        procs.append(mp.Process(target=test_function,args=(x,p,Ni)))
    
    st = time.time()
    for p in procs:
        p.start()
    
    for p in procs:
        p.join()

    print(f'total time: {time.time()-st}')

if __name__ == '__main__':
    main()

The above code calculates the summation of the pow(x,p) for 6000 times. The total time consumption t6 for N_core=6 is significantly less than t1 for N_core=1, although t6 > (t1 / 6). So using two processes does not result in half the time for one process. The reason may be that the cpu cores always work together or share some common resources with a mechanism defined by the OS system even if only one process exists.

Speed per process getting slower with more processes

3 Answers3