1

I'm using multiprocessing on a cluster, using a single node which has 20 cores. Although I reserve only 10 cpus (-n 1 and -c 10 in Slurm) and the multiprocessing Pool is started with 8 workers, I see in the clusters monitor (Ganglia monitor) that the load exceeds by far the number of cpus reserved. With this setting I'm getting around 30 procs being loaded in the node.

enter image description here

I don't understand why I'm getting more processes than the number of workers I'm instantiating. The problem is worse if I reserve 20 cpus and let Pool set the number of workers automatically, with the number of processes jumping to about 100. Now the real problem is that the code can not run under this conditions because the admins cancel tasks that set more processes than the number of cpus in the node (after a few hours).

My code is basically solving a large linear algebra problem that can be solved by independent blocks, and it's structure is like this:

import pandas as pd
import numpy as np
import multiprocessing as mp

class storer:
     res = pd.DataFrame(columns=['A','B',...])


def job(manuf, week):
    # Some intensive job using the global data
    # an np.linalg
    return res

def child_initialize(_data):
    global data
    data = _data

def err_handle(err):
    raise err

def join_results(job_res):
    storer.res = storer.res.append(job_res, ignore_index=True)

def run_jobs(data, grid, output_file):
    pool = mp.Pool(8, initializer=child_initialize,
                   initargs=(data, ))
    for idx, row in grid.iterrows():
         pool.apply_async(job, 
              args=(row[0], row[1]),
              callback = join_results, error_callback=err_handle)
    pool.close()
    pool.join()
    storer.res.to_csv(output_file)
    return True

if __name__=="__main__":
    #get data, grid, and output_file from sys.argv and from some csv
    run_jobs(data, grid, output_file)
BVJ
  • 568
  • 4
  • 18

0 Answers0