0

The problem is not related to the number of the CPU assigned to the job. Before this problem, I had an error with the Nvidia driver configuration in a way that I couldn't detect the GPUs by 'nvidia-smi', after solving that error by running 'NVIDIA-Linux-x86_64-410.79.run --no-drm' I have encountered this error. Appreciate very much any help!

PS Before the first problem, I could run similar jobs smoothly

command: sbatch md.s
sbatch: error: Batch job submission failed: Requested node configuration is not available


command: 'sinfo -o "%g %.10R %.20l %.10c"'
GROUPS  PARTITION            TIMELIMIT       CPUS
all gpucompute             infinite         32


command:'sinfo -Nl'
Thu Sep 24 21:06:35 2020
NODELIST        NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
fwb-lab-tesla1      1 gpucompute*       down*   32   32:1:1  64000        0      1   (null) Not responding     


md.s
!/bin/bash

SBATCH --job-name=Seq1_md1
SBATCH --nodes=1
SBATCH --cpus-per-task=2
SBATCH --mem=3GB
SBATCH --mem-per-cpu=1gb
SBATCH --gres=gpu:Titan
SBATCH --mail-user=shirin.jamshidi@kcl.ac.uk
SBATCH --mail-type=ALL

module purge
module load amber/openmpi/intel/16.06   
Navigate where data is
cd /home/SCRATCH/Seq1

mpirun -np 1 pmemd.cuda.MPI -O -i md1.in -o Seq1_md1.out -p Seq1.prmtop -c Seq1_min2.rst -r Seq1_md1.rst -x Seq1_md1.mdcrd -e Seq1_md1.mden -ref Seq1_min2.rst > md1.log
Charlt
  • 17
  • 9

1 Answers1

1

You sinfo command reports the node as down*, which means it is marked as down by slurm and the slurmd is not reachable. So there is definitely something wrong with the node, which you cannot solve from the user side.

Marcus Boden
  • 1,495
  • 8
  • 11