SLURM / NFS based computing cluster with disk uniterruptible sleep issues (state : D)

Asked Apr 05 '23 at 14:07

Active Apr 05 '23 at 14:07

Viewed 120 times

Context :

We have a computing cluster based on 7 servers, running Debian 11:

a storage (HDD NAS, ~500TB, RAID5, LVM)
a frontal server, running SLURM, nfs-common
5 nodes on which the storage is mounted through NFS.

When business users run SLURM jobs on frontal, python threads are ditributed to nodes, which read & write data on their shared NFS mount.

Everything was working fine until last week. We lost control of "frontal" : We couldn't interact with it through ssh or local console. We decided to reboot it, and took this opportunity to upgrade its kernel from 5.10.140 to 5.10.162

Since then, SLURM jobs are most of the time in an "uninterruptible sleep" state ("D"), and mostly failing.

We have rollback'ed the kernel to version 5.10.140, but the problem remains.

Do you have any ideas ?

asked Apr 05 '23 at 14:07

Grégory Hare

Can you share the output of `iostat -x -k 1`, `nfsiostat 1` and `nfsstat -s` taken on the NFS server when you have jobs in `D` state? – shodanshok Apr 05 '23 at 14:24
Thank you for your answer ! We currently are running a RAID check on the storage to be sure the disks are not to blame. But I'll plan to do it as soon as the RAID will be available. – Grégory Hare Apr 06 '23 at 12:54

SLURM / NFS based computing cluster with disk uniterruptible sleep issues (state : D)

0 Answers0