Python code takes longer to run with MPI (SLURM) than as a single process

Question

I have some python code which takes approximately 12 hours to run on my laptop (MacOS 16GB 2133 MHz LPDDR3). The code is looping over a few thousand iterations and doing some intensive processing at each step so it makes sense to parallelise the problem with MPI processing. I have access to a slurm cluster, where I have built mpi4py (for python 2.7) against their OpenMPI implementation with mpicc. I then submit the following submission script with sbatch --exclusive mysub.sbatch:

#!/bin/bash
#SBATCH -p par-multi
#SBATCH -n 50
#SBATCH --mem-per-cpu=8000
#SBATCH -t 48:00:00
#SBATCH -o %j.log
#SBATCH -e %j.err

module add eb/OpenMPI/gcc/3.1.1

mpirun python ./myscript.py

which should split the tasks across 50 processors, each of which with an 8GB memory allocation. My code does something like the following:

import numpy as np
import pickle
import mpi4py

COMM = MPI.COMM_WORLD

def split(container, count):
    return [container[_i::count] for _i in range(count)]
    
def read():
    #function which reads a series of pickle files from my home directory
    return data
    
def function1():
    #some process 1
    return f1

def function2():
    #some process 2
    return f2

def main_function(inputs):
    #some process which also calls function1 and function2
    f1 = function1(inputs)
    f2 = function2(f1)
    result = #some more processing
    return result
    
### define global variables and read data ###
data = read()
N = 5000
#etc...

selected_variables = range(N)

if COMM.rank == 0:
    splitted_jobs = split(selected_variables, COMM.size)
else:
    splitted_jobs = None

scattered_jobs = COMM.scatter(splitted_jobs, root=0)

results = []
for index in scattered_jobs:
    outputs = main_function(data[index])
    results.append(outputs)
results = COMM.gather(results, root=0)
        
if COMM.rank == 0:
    all_results = []
    for r in results:
        all_results.append(r)
        
    f = open('result.pkl','wb')
    pickle.dump(np.array(all_results),f,protocol=2)
    f.close()

The maximum run time I can allocate for my job is 48 hours, at which point the job has not even finished running. Could anyone tell me if there is something in either my submission script or my code which is likely causing this to be very slow?

Thanks

Did you run the same (single process) job on the cluster? First make sure running 50 MPI tasks (on the same node?) does not cause memory swap. Then start by having each rank print `COMM.rank` and `COMM.size`. If this is what you expect, then have each rank print its `scattered_jobs` to make sure `split&scatter` did what you expect. Keep in mind that if your job is memory bound, running many tasks on the **same** node will scale poorly.. — Gilles Gouaillardet, Oct 18 '20 at 10:48
Thanks for your reply Gilles. I have tried to run the same single process job on the cluster and it is also suffering from very long run times. I'm not sure how to check for memory swap, but I did do some investigation which showed that I don't actually need 8GB per cpu, so I tried running it again with lower memory (no success). For 50 cpus the job is assigned by default to run across 3 nodes. I also tried increasing this to 5 nodes but still no luck. Printing `COMM.rank`, `COMM.size` and `scattered_jobs` returns what I am expecting so I'm not sure where to go from here. — Will Gregory, Oct 20 '20 at 08:33
ah, no it was as a serial job. I will run with a single MPI task and report back — Will Gregory, Oct 20 '20 at 09:36
If the very same serial job runs much slower on the cluster, then you can rule out MPI. Is your job I/O intensive? Are you alone on a given node of the cluster? What about CPU generations, clocks and turbo mode if applicable? — Gilles Gouaillardet, Oct 20 '20 at 10:43
Ok, makes sense. The job is only reading data once at the beginning and writing once at the end. In between I am doing several thousand iterations, where at each step I perform a numerical optimisation with `scipy.optimize.minimize`, where the minimiser calls a function `f` which has a process with run time complexity O(n^3). I doubt I am alone on the given node of the cluster, although when I run the MPI job I do specify the `--exclusive` key. I have sent an email to the Helpdesk asking about your other comments as I am unsure myself. — Will Gregory, Oct 20 '20 at 12:00
an update on this: I managed to get the job to finish in a reasonable time (< 1 hour) by splitting all of the processors across different nodes, i.e. in my submission script I have `-n 50` and `-N 50`. I'm not totally sure why the serial job was taking so long but at least it is working with MPI. — Will Gregory, Oct 24 '20 at 13:49
that typically occurs if your application is memory bound, or performs massive I/O on the local filesystem (e.g. `/tmp`) — Gilles Gouaillardet, Oct 24 '20 at 13:59

Python code takes longer to run with MPI (SLURM) than as a single process

0 Answers0