I have a cluster of many nodes with many cores and I simply want to run thousands of jobs that just require a single CPU each on them. Preferably with sbatch
. After going through the documentation for several hours I still run into problems. My current setup is:
#SBATCH --nodes=4
#SBATCH --tasks-per-node=25
#SBATCH --distribution=block
srun ./my_experiment
I start several of these with sbatch
and they seem to queue up nicely.
This script starts 100 instances of my_experiment
which is intended. Unfortunately they seem to hog the resources of all 100 CPUs even if 99 experiments already ended. How do I alleviate this?
Secondly they don't seem to share nodes with each other. Even though the nodes have +40 cores.
Is it even possible to sbatch
a bunch of tasks and have them release their resources individually?