0

Context :

We have a computing cluster based on 7 servers, running Debian 11:

  • a storage (HDD NAS, ~500TB, RAID5, LVM)
  • a frontal server, running SLURM, nfs-common
  • 5 nodes on which the storage is mounted through NFS.

When business users run SLURM jobs on frontal, python threads are ditributed to nodes, which read & write data on their shared NFS mount.

Everything was working fine until last week. We lost control of "frontal" : We couldn't interact with it through ssh or local console. We decided to reboot it, and took this opportunity to upgrade its kernel from 5.10.140 to 5.10.162

Since then, SLURM jobs are most of the time in an "uninterruptible sleep" state ("D"), and mostly failing.

We have rollback'ed the kernel to version 5.10.140, but the problem remains.

Do you have any ideas ?

  • Can you share the output of `iostat -x -k 1`, `nfsiostat 1` and `nfsstat -s` taken on the NFS server when you have jobs in `D` state? – shodanshok Apr 05 '23 at 14:24
  • Thank you for your answer ! We currently are running a RAID check on the storage to be sure the disks are not to blame. But I'll plan to do it as soon as the RAID will be available. – Grégory Hare Apr 06 '23 at 12:54

0 Answers0