I want to run a fortran code called orbits_01
on SLURM. I want to run multiple jobs simultaneously (i.e. parallelize over multiple cores). After running multiple jobs, each orbits_01
program will call another executable called optimizer
, and the optimizer
will constantly call another Python script called relax.py
. When I submitted the jobs to SLURM by sbatch python main1.py
, the jobs failed to even call the optimizer
. However, the whole scheme works fine when I ran locally. The local process status is shown below:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
shuha 39395 0.0 0.0 161540 3064 ? S Oct22 0:19 sshd: shuha@pts/72
shuha 39396 0.0 0.0 118252 5020 pts/72 Ss Oct22 0:11 \_ -bash
shuha 32351 0.3 0.0 318648 27840 pts/72 S 02:08 0:00 \_ python3 main1.py
shuha 32968 0.0 0.0 149404 1920 pts/72 R+ 02:10 0:00 \_ ps uxf
shuha 32446 0.0 0.0 10636 1392 pts/72 S 02:08 0:00 ../orbits_01.x
shuha 32951 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c ./optimizer >& log
shuha 32954 0.0 0.0 1716076 1376 pts/72 S 02:10 0:00 \_ ./optimizer
shuha 32955 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c python relax.py > relax.out
shuha 32956 99.6 0.0 749900 101944 pts/72 R 02:10 0:02 \_ python relax.py
shuha 32410 0.0 0.0 10636 1388 pts/72 S 02:08 0:00 ../orbits_01.x
shuha 32963 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c ./optimizer >& log
shuha 32964 0.0 0.0 1716076 1376 pts/72 S 02:10 0:00 \_ ./optimizer
shuha 32965 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c python relax.py > relax.out
shuha 32966 149 0.0 760316 111992 pts/72 R 02:10 0:01 \_ python relax.py
shuha 32372 0.0 0.0 10636 1388 pts/72 S 02:08 0:00 ../orbits_01.x
shuha 32949 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c ./optimizer >& log
shuha 32950 0.0 0.0 1716076 1376 pts/72 S 02:10 0:00 \_ ./optimizer
shuha 32952 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c python relax.py > relax.out
shuha 32953 100 0.0 749892 101936 pts/72 R 02:10 0:03 \_ python relax.py
I have a main Python script called main1.py
, which does a for loop to run multiple orbits_01
jobs at the same time. Then the main script will wait for all jobs to finish. Here 3 parent orbits_01
jobs are running in parallel, and each parent job has multiple child jobs. The heavy computations are done by the python code relax.py
, so each job should be able to run only using one core. I want to know what is the best way to submit and parallelize multiple parent jobs with multiple child jobs over all cores in one node on SLURM?