0

I want to emulate SLURM on Ubuntu 16.04. I don't need serious resource management, I just want to test some simple examples. I cannot install SLURM in the usual way, and I am wondering if there are other options. Other things I have tried:

  • A Docker image. Unfortunately, docker pull agaveapi/slurm; docker run agaveapi/slurm gives me errors:

    /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security. 'Supervisord is running as root and it is searching ' 2017-10-29 15:27:45,436 CRIT Supervisor running as root (no user in config file) 2017-10-29 15:27:45,437 INFO supervisord started with pid 1 2017-10-29 15:27:46,439 INFO spawned: 'slurmd' with pid 9 2017-10-29 15:27:46,441 INFO spawned: 'sshd' with pid 10 2017-10-29 15:27:46,443 INFO spawned: 'munge' with pid 11 2017-10-29 15:27:46,443 INFO spawned: 'slurmctld' with pid 12 2017-10-29 15:27:46,452 INFO exited: munge (exit status 0; not expected) 2017-10-29 15:27:46,452 CRIT reaped unknown pid 13) 2017-10-29 15:27:46,530 INFO gave up: munge entered FATAL state, too many start retries too quickly 2017-10-29 15:27:46,531 INFO exited: slurmd (exit status 1; not expected) 2017-10-29 15:27:46,535 INFO gave up: slurmd entered FATAL state, too many start retries too quickly 2017-10-29 15:27:46,536 INFO exited: slurmctld (exit status 0; not expected) 2017-10-29 15:27:47,537 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2017-10-29 15:27:47,537 INFO gave up: slurmctld entered FATAL state, too many start retries too quickly

  • This guide to start a SLURM VM via Vagrant. I tried, but copying over my munge key timed out.

    sudo scp /etc/munge/munge.key vagrant@server:/home/vagrant/ ssh: connect to host server port 22: Connection timed out lost connection

landau
  • 5,636
  • 1
  • 22
  • 50
  • 2
    Will: I always used it plain up, and liked that. I haven't looked at your other question -- worst case I sometimes locally rebuillt the slurm packages. I would recommend leaning on the Debian / Ubuntu resources. I may be able to help you off-line, but I am current traveling. – Dirk Eddelbuettel Oct 29 '17 at 16:52
  • Thanks, Dirk. It would certainly be best to use SLURM natively if it will install. Do you know of any guides to set up a `slurm.conf` that lets the host machine also be a worker node? – landau Oct 29 '17 at 17:39
  • 2
    I'll copy and paste tomorrow when back at work. It is pretty straightforward as I recall but it has been while. The deb package has a helper script too... – Dirk Eddelbuettel Oct 29 '17 at 17:56
  • 1
    Check this post https://stackoverflow.com/questions/40695348/running-multiple-worker-daemons-slurm. Hope this helps. – Bub Espinja Oct 30 '17 at 07:52

2 Answers2

1

I would still prefer to run SLURM natively, but I caved and spun up a Debian 9.2 VM. See here for my efforts to troubleshoot a native installation. The directions here worked smoothly, but I needed to make the following changes to slurm.conf. Below, Debian64 is the hostname, and wlandau is my user name.

  • ControlMachine=Debian64
  • SlurmUser=wlandau
  • NodeName=Debian64

Here is the complete slurm.conf. An analogous slurm.conf did not work on my native Ubuntu 16.04.

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=Debian64
#ControlAddr=
#BackupController=
#BackupAddr=
# 
AuthType=auth/munge
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge
#DisableRootJobs=NO 
#EnforcePartLimits=NO 
#Epilog=
#EpilogSlurmctld= 
#FirstJobId=1 
#MaxJobId=999999 
#GresTypes= 
#GroupUpdateForce=0 
#GroupUpdateTime=600 
#JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0 
#JobRequeue=1 
#JobSubmitPlugins=1 
#KillOnBadExit=0 
#LaunchType=launch/slurm 
#Licenses=foo*4,bar 
#MailProg=/usr/bin/mail 
#MaxJobCount=5000 
#MaxStepCount=40000 
#MaxTasksPerNode=128 
MpiDefault=none
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags= 
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
#RebootProgram= 
ReturnToService=1
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=wlandau
#SlurmdUser=root 
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree 
#TmpFS=/tmp 
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UsePAM=0 
# 
# 
# TIMERS 
#BatchStartTimeout=10 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=0 
#HealthCheckProgram= 
InactiveLimit=0
KillWait=30
#MessageTimeout=10 
#ResvOverRun=0 
MinJobAge=300
#OverTimeLimit=0 
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60 
#VSizeFactor=0 
Waittime=0
# 
# 
# SCHEDULING 
#DefMemPerCPU=0 
FastSchedule=1
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
# 
# 
# JOB PRIORITY 
#PriorityFlags= 
#PriorityType=priority/basic 
#PriorityDecayHalfLife= 
#PriorityCalcPeriod= 
#PriorityFavorSmall= 
#PriorityMaxAge= 
#PriorityUsageResetPeriod= 
#PriorityWeightAge= 
#PriorityWeightFairshare= 
#PriorityWeightJobSize= 
#PriorityWeightPartition= 
#PriorityWeightQOS= 
# 
# 
# LOGGING AND ACCOUNTING 
#AccountingStorageEnforce=0 
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags= 
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none 
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile= 
#SlurmSchedLogLevel= 
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
NodeName=Debian64 CPUs=1 RealMemory=744 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN 
PartitionName=debug Nodes=Debian64 Default=YES MaxTime=INFINITE State=UP
landau
  • 5,636
  • 1
  • 22
  • 50
  • I do not see, or cannot tell from what you posted, in which this is different from the default. As such, this is not all that helpful. We understand what a username and hostname are. What else, if anything, needed changing? What, if anything, would constitute a bug in the package or its documentation and what have you done to communicate that to the package maintainer? – Dirk Eddelbuettel Nov 01 '17 at 11:57
  • Hmm... maybe I should have instead emphasized the virtualization instead (VirtualBox and Debian 9.2). Like you said before, there are known SLURM issues on Ubuntu where Debian is unaffected, and that is what I think I experienced. Also, I had a surprising amount of trouble finding explicit setup instructions that allow the host to be the same as a node. In any case, for the Ubuntu issue, I emailed the slurm-dev mailing list earlier in the week, but have not heard back. – landau Nov 01 '17 at 17:12
  • There is no issue with host==client. I had that in all my use cases (mostly on Ubuntu). Make sure /etc/hosts has your name so that your machine is known by names other than localhost. – Dirk Eddelbuettel Nov 01 '17 at 17:23
1

So ... we have an existing cluster here but it runs an older Ubuntu version which does not mesh well with my workstation running 17.04.

So on my workstation, I just made sure I slurmctld (backend) and slurmd installed, and then set up a trivial slurm.conf with

ControlMachine=mybox
# ...
NodeName=DEFAULT CPUs=4 RealMemory=4000 TmpDisk=50000 State=UNKNOWN
NodeName=mybox CPUs=4 RealMemory=16000

after which I restarted slurmcltd and then slurmd. Now all is fine:

root@mybox:/etc/slurm-llnl$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
demo         up   infinite      1   idle mybox
root@mybox:/etc/slurm-llnl$ 

This is a degenerate setup, our real one has a mix of dev and prod machine and appropriate partitions. But this should answer your "can backend really be client" question. Also, my machine is not really called mybox but is not really pertinent for the question in either case.

Using Ubuntu 17.04, all stock, with munge to communicate (which is the default anyway).

Edit: To wit:

me@mybox:~$ COLUMNS=90 dpkg -l '*slurm*' | grep ^ii
ii  slurm-client     16.05.9-1ubun amd64         SLURM client side commands
ii  slurm-wlm-basic- 16.05.9-1ubun amd64         SLURM basic plugins
ii  slurmctld        16.05.9-1ubun amd64         SLURM central management daemon
ii  slurmd           16.05.9-1ubun amd64         SLURM compute node daemon
me@mybox:~$
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • So it really should work in Ubuntu. This makes me want to try again. Did you install SLURM using `apt-get` or build it yourself? For me, `apt-get` installs 15.08.7, but 16.05.11, 17.02.9 and 17.11.0rc2 can be downloaded. – landau Nov 01 '17 at 17:50
  • I used to locally build (for that aforementioned reason of the directory permissions; this may now be accomodated upstream); what I show above is stock 17.04. See edited answer. – Dirk Eddelbuettel Nov 01 '17 at 17:55
  • I just reinstalled my OS (for unrelated reasons), and SLURM actually works this time. I ended up going with Damien Francois' answer [here](https://stackoverflow.com/questions/46966876/installing-emulating-slurm-on-an-ubuntu-16-04-desktop-slurmd-fails-to-start). – landau Dec 18 '17 at 22:40