I am running a complicated python method (single run took about 3-5 mins) over multiple cases using multiprocess
. I noticed that the running speed is normal when I first kicked off the program: I can get about 20-30 outputs in 5 mins using 30 cores, however, as time goes by, the performance gradually decreases, for example, at some point, I can only get 5 outputs in 30 mins. Then I kill the program and re-run it, then it follows the same behavior. What might be the reason? Is it because of overhead? What can I do? Please see following for a sample code of the parallel program: I am reading pickles (class instance) in the code
import multiprocess as mp
import os
import pickle
def run_all_cases(input_folder):
pickle_files = [x for x in os.listdir(input_folder)]
jobs = [(file, input_folder) for file in pickle_files]
num_process = max(mp.cpu_count()-1, 1)
with mp.Pool(processes=num_process) as pool:
pool.starmap(run_single_case, jobs)
def run_single_case(file_name, input_folder):
print(f"started {file} using {os.getpid()}")
data = pickle.load(input_folder + file_name)
# a complicated method in a class
data.run_some_method()
pickle.dump(data, f"{file_name.split("_")[0]}_output.pkl")
print(f"finished {file} using {os.getpid()}")
Also, when I print out the process id, it is changing overtime. Is this expected (started file_8 using a new process id)? The output is something like (if using 5 cores):
started file_1 using core 8001
started file_2 using core 8002
started file_3 using core 8003
started file_4 using core 8004
started file_5 using core 8005
finished file_1 using core 8001
started file_6 using core 8001
finished file_2 using core 8002
started file_7 using core 8002
started file_8 using core 8006 #<-- it starts a new process id, rather than using the existing ones, is this expected?
finished file_3 using core 8003
...
=================================
UPDATES: I dive deep on some instances where the process is dead after processing them. When I ran the single instance, after some time, in terminal it says:
zsh: killed python3 test.py
I guess this is the issue: the process is killed due to possible memory issues (any other reasons) and the parent process didn't start a new process to continue running the job, so the number of processes decreases over time, thus I see a performance drop overall.