Questions tagged [slurm]

Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

45 questions
0
votes
1 answer

What does "CPU Minutes" mean exactly?

I'm actually trying to report cluster utilization in Slurm but i don't understand the metric CPU Minutes. [root@XXXX]# sreport cluster Utilization Start=2018-12-01…
m4hmud
  • 3
  • 3
0
votes
0 answers

SLURM, SSH, adn NOHUP Behaviour

I am the administrator of a cluster running on CentOS and using SLURM to send jobs from a login node to compute nodes. Recently, a user complained about some unexpected behaviour with their jobs. If a user starts a job with srun and then logs out,…
0
votes
1 answer

Why does the login node connect to external networks but allocated compute node fail in Slurm-GCP?

I've noticed that connecting to the internet from the allocated compute node via Slurm-GCP keeps failing. For example, using wget from the login node works successfully: [me@gcp-login0 ~]$ wget…
0
votes
0 answers

SLURM / NFS based computing cluster with disk uniterruptible sleep issues (state : D)

Context : We have a computing cluster based on 7 servers, running Debian 11: a storage (HDD NAS, ~500TB, RAID5, LVM) a frontal server, running SLURM, nfs-common 5 nodes on which the storage is mounted through NFS. When business users run SLURM…
0
votes
1 answer

Setting up slurm with 2 different nodes and 2 different partation on 1 physical server

I have a requirement for setting up slurm on one physical server, with 2 different partation and 2 main node so, need to have, partation1 wihich need to have node1 and need to be used by group1 user partation2 wihich need to have node2 and need to…
biplab
  • 5
  • 2
0
votes
1 answer

Docker Compose + Ubuntu:22.04-Unable To Create cgroup...Read-only file system

I'm playing around with Docker Desktop(4.16.3) and Slurm. When I run slurmd, I get an error with the following complaint: common_cgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/freezer/slurm' : Read-only file system unable to build…
Black Dynamite
  • 523
  • 2
  • 5
  • 16
0
votes
0 answers

Use more CPU cores using code in PyTorch that also uses GPU

I'm trying to successfully run code in PyTorch that uses DataLoader. It is possible to configure the DataLoader to load data using several processes (which speeds up data loading a lot), via the use of the num_workers argument, configuring it with a…
Marco
  • 1
0
votes
0 answers

Troubleshooting slurm e-mail settings

I am trying to setup a slurm installation and I have advanced towards the e-mail stage. So far I do not receive any mails. I have a working setup using msmtp-mta and msmtp. When I batch a script the slurmctld log shows email msg to **@**: Slurm…
hfhc2
  • 101
0
votes
0 answers

Slurm jobs undesirably get access to all threads

I have one Ryzen R9 5950x CPU (16 cores/32 threads), one Xeon Phi 7120p card and partition/node in slurm.conf defined as: NodeName=mic0 RealMemory=15000 Sockets=1 CoresPerSocket=61 ThreadsPerCore=4 State=UNKNOWN PartitionName=compute Nodes=mic0…
Igor Popov
  • 101
  • 2
0
votes
0 answers

slurm_load_partitions:Unable to contact slurm controller

I had this problem using the slurm command.Unable to contact slurm controller. this is part of my slurm.conf: ######################################## # YHPC…
zhen
  • 1
  • 1
0
votes
0 answers

Setting up slurm on a cluster

My IT admin has setup a cluster with 3 nodes, which is administered via Windows server. VMs are hosted via Hyper-V, including an Ubuntu VM to which a substantial portion of the cluster's resources have been allocated. Does anyone have any…
0
votes
0 answers

Some process are in unkillable sleeping state while i/o is low

I am the system administrator of an Arch Linux-based workstation. Our workstation uses Slurm as the load manager and consists of one master machine and 4 other computation nodes. In the past few months, we observe that processes on some nodes are…
0
votes
1 answer

Update SLURM node state prior/after playbook execution

I would like to automatically set the state of a node in a SLURM cluster before/after running my Ansible playbook (from idle to drained and after applying the playbook back to idle). The scontrol command that is required for this, is only available…
Patrick
  • 121
  • 5
0
votes
1 answer

Slurm cluster in Google Cloud - how do I attach a Filestore instance

If one clones the slurm-gcp project and deploys the stock cluster defined in there, things work well. What I would like to do, is to use a GCP Filestore instance to provide (more) persistent storage to the cluster. Part of the cluster deployment is…
bolind
  • 181
  • 5
-1
votes
1 answer

Using HPC managers like Slurm on multiple servers in LAN

I have access to a group of servers connected with a 1Gb LAN, and each of them has 40+ cores and Ubuntu OS. They all have a common NAS. I installed SLURM on a few of them and configured it so that each server is both a control and a compute node,…
1 2
3