I am trying to start slurmd.service using below commands but it is not successful permanently. I will be grateful if you could help me to resolve this issue!
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
This is the status of slurmd.service
cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
and this the status of the node:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 drain fwb-lab-tesla1
$ sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory root 2020-09-28T16:46:28 fwb-lab-tesla1
$ sinfo -Nl
Thu Oct 1 14:00:10 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* drained 32 32:1:1 64000 0 1 (null) Low RealMemory
Here there is the contents of slurm.conf
$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP
There is not any slurmd.pid in the below path. Just once by starting system it appears here but it is gone after few minutes again.
$ ls /var/run/
abrt cryptsetup gdm lvm openvpn-server slurmctld.pid tuned
alsactl.pid cups gssproxy.pid lvmetad.pid plymouth sm-notify.pid udev
atd.pid dbus gssproxy.sock mariadb ppp spice-vdagentd user
auditd.pid dhclient-eno2.pid httpd mdadm rpcbind sshd.pid utmp
avahi-daemon dhclient.pid initramfs media rpcbind.sock sudo vpnc
certmonger dmeventd-client ipmievd.pid mount samba svnserve xl2tpd
chrony dmeventd-server lightdm munge screen sysconfig xrdp
console ebtables.lock lock netreport sepermit syslogd.pid xtables.lock
crond.pid faillock log NetworkManager setrans systemd
cron.reboot firewalld lsm openvpn-client setroubleshoot tmpfiles.d
[shirin@FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
Main PID: 1492 (slurmctld)
CGroup: /system.slice/slurmctld.service
ââ1492 /usr/sbin/slurmctld
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.
I try to start the service slurmd.service
but it returns to failed after few minutes again
$ systemctl status slurmd
â slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
CGroup: /system.slice/slurmd.service
ââ2986 /usr/sbin/slurmd
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
Log output of starting slurmd:
[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```