SLURM sbatch multiple parent jobs in parallel, each with multiple child jobs

Question

I want to run a fortran code called orbits_01 on SLURM. I want to run multiple jobs simultaneously (i.e. parallelize over multiple cores). After running multiple jobs, each orbits_01 program will call another executable called optimizer, and the optimizer will constantly call another Python script called relax.py. When I submitted the jobs to SLURM by sbatch python main1.py, the jobs failed to even call the optimizer. However, the whole scheme works fine when I ran locally. The local process status is shown below:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
shuha    39395  0.0  0.0 161540  3064 ?        S    Oct22   0:19 sshd: shuha@pts/72
shuha    39396  0.0  0.0 118252  5020 pts/72   Ss   Oct22   0:11  \_ -bash
shuha    32351  0.3  0.0 318648 27840 pts/72   S    02:08   0:00      \_ python3 main1.py
shuha    32968  0.0  0.0 149404  1920 pts/72   R+   02:10   0:00      \_ ps uxf
shuha    32446  0.0  0.0  10636  1392 pts/72   S    02:08   0:00 ../orbits_01.x
shuha    32951  0.0  0.0 113472  1472 pts/72   S    02:10   0:00  \_ sh -c ./optimizer >& log
shuha    32954  0.0  0.0 1716076 1376 pts/72   S    02:10   0:00      \_ ./optimizer
shuha    32955  0.0  0.0 113472  1472 pts/72   S    02:10   0:00          \_ sh -c python relax.py > relax.out
shuha    32956 99.6  0.0 749900 101944 pts/72  R    02:10   0:02              \_ python relax.py
shuha    32410  0.0  0.0  10636  1388 pts/72   S    02:08   0:00 ../orbits_01.x
shuha    32963  0.0  0.0 113472  1472 pts/72   S    02:10   0:00  \_ sh -c ./optimizer >& log
shuha    32964  0.0  0.0 1716076 1376 pts/72   S    02:10   0:00      \_ ./optimizer
shuha    32965  0.0  0.0 113472  1472 pts/72   S    02:10   0:00          \_ sh -c python relax.py > relax.out
shuha    32966  149  0.0 760316 111992 pts/72  R    02:10   0:01              \_ python relax.py
shuha    32372  0.0  0.0  10636  1388 pts/72   S    02:08   0:00 ../orbits_01.x
shuha    32949  0.0  0.0 113472  1472 pts/72   S    02:10   0:00  \_ sh -c ./optimizer >& log
shuha    32950  0.0  0.0 1716076 1376 pts/72   S    02:10   0:00      \_ ./optimizer
shuha    32952  0.0  0.0 113472  1472 pts/72   S    02:10   0:00          \_ sh -c python relax.py > relax.out
shuha    32953  100  0.0 749892 101936 pts/72  R    02:10   0:03              \_ python relax.py

I have a main Python script called main1.py, which does a for loop to run multiple orbits_01 jobs at the same time. Then the main script will wait for all jobs to finish. Here 3 parent orbits_01 jobs are running in parallel, and each parent job has multiple child jobs. The heavy computations are done by the python code relax.py, so each job should be able to run only using one core. I want to know what is the best way to submit and parallelize multiple parent jobs with multiple child jobs over all cores in one node on SLURM?

How are jobs submitted? With a submission script or with a Python package? — damienfrancois, Oct 23 '20 at 11:14
The jobs are submitted by simply submitting a Python script ```main1.py```, which creates multiple directories and calls ```os.system('./orbits_01.x < input.dat &')``` in each of them. — Shaun Han, Oct 23 '20 at 11:24

SLURM sbatch multiple parent jobs in parallel, each with multiple child jobs

0 Answers0