0

I'm encountering a slurm error.

I logged into slurm controller to verify if slurm is working properly

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      8   idle ip-192-168-73-[129,179],.....

Checked if nodes are properly recognized

$ scontrol show nodes

Unsure what is mean't by invalid feature specification

Corresponding salloc command is

N2=$(($N*2))
salloc -N $N2 '--constraint=[worker*"$N"&server*"$N"] test' \
           $CLUSTER_SHARED_FOLDER/scripts/$CLUSTER_CONTROLLER_SH \
           2>&1 | tee $CLUSTER_OUTPUT_LOG \
Chaitanya Bapat
  • 3,381
  • 6
  • 34
  • 59

1 Answers1

1

The error was because the the cluster consisted of 8 nodes. However, srun command was

N2=$(($N*2))
salloc -N $N2 '--constraint=[worker*"$N"&server*"$N"] test' \
           $CLUSTER_SHARED_FOLDER/scripts/$CLUSTER_CONTROLLER_SH \
           2>&1 | tee $CLUSTER_OUTPUT_LOG \

i.e. Number of workers & servers was more than the number of nodes. This mismatch caused the salloc command to give invalid specification error.

The fix was to ensure number of workers = number of nodes available

N2=$(($N))
salloc -N $N2 '--constraint=[worker*"$N"] test' \
           $CLUSTER_SHARED_FOLDER/scripts/$CLUSTER_CONTROLLER_SH \
           2>&1 | tee $CLUSTER_OUTPUT_LOG \
Chaitanya Bapat
  • 3,381
  • 6
  • 34
  • 59