0

I am running a job on a multinode cluster with slurm, OpenMPI, and python (anaconda with MKL). When I submit the job it all seems to work as expected. However, if I login to one of the nodes running the job and use htop to see the running processes I see the jobs that I started and for each one I see 10 more "clone" processes that occupy the same memory as the job I started but have a 0 CPU load (all that changes is the PID and the CPU(0%) everything else is the same).

Can anyone explain this behavior?

Thanks!

P.S. here is the batchscript I use to submit the jobs:

#!/bin/zsh
#SBATCH --job-name="DSC on Natims"
#SBATCH -n 16
#SBATCH -N 8
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=20G
#SBATCH  --output="log_dsc%j.out"
#SBATCH  --error="log_dsc%j.err"
mpiexec -iface bond0 python dsc_run.py
gex
  • 11
  • 3

1 Answers1

0

These are threads started by the program, so they are part of the same process. Toggle the display of process threads pressing uppercase "H" in htop to see the difference. Press F2 to see the Display Options in the Setup menu. You can toggle to also display threads in a different color.

Hisham H M
  • 6,398
  • 1
  • 29
  • 30