3

I administer a Slurm cluster with many users and the operation of the cluster currently appears "totally normal" for all users; except for one. This one user gets all attempts to run commands through Slurm killed after 20-25 seconds.
The following minimal example reproduces the error:

$ sudo -u <the_user> srun --pty sleep 25
srun: job 110962 queued and waiting for resources
srun: job 110962 has been allocated resources
srun: Force Terminated job 110962
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 110962.0 ON <node> CANCELLED AT 2021-04-09T16:33:35 ***
srun: error: <node>: task 0: Terminated

When this happens, I find this line in the slurmctld log:

_slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=110962 uid <the_users_uid>

It only happens for '<the_user>' and not for any other user that I know of. This very similar but shorter-running example works fine:

$ sudo -u <the_user> srun --pty sleep 20
srun: job 110963 queued and waiting for resources
srun: job 110963 has been allocated resources

Note that when I run srun --pty sleep 20 as myself, srun does not output the two srun: job... lines. This seems to me to be an additional indication that srun is subject to some different settings for '<the_user>'.
All settings that I have been able to inspect appear identical for '<the_user>' as for other users. I have checked, and 'MaxWall' is not set for this user and not for any other user, either. Other users belonging to the same Slurm account do not experience this problem.
This question sounds related, but I do not think the explanation seems to be the same.

What could be causing this?

Update - the plot thickens

When this unfortunate user's jobs get allocated, I see this message in '/var/log/slurm/slurmctld.log':

sched: _slurm_rpc_allocate_resources JobId=111855 NodeList=<node>

and shortly after, I see this message:

select/cons_tres: common_job_test: no job_resources info for JobId=110722_* rc=0

Job 110722_* is a pending array job by another user that is pending due to 'QOSMaxGRESPerUser'. One pending part of this array job (110722_57) eventually ends up taking over job 111855's CPU cores when 111855 gets killed. This leads me to believe that 110722_57 causes 111855 to be killed. However, 110722_57 remains pending afterwards.

Some of the things I fail to understand here are:

  • Why would a pending job kill another job, yet remain pending afterwards?
  • How would a pending job even have privileges to kill another job in the first place?
  • Why would this only affect '<the_user>'s jobs but not those of other users?

None of this is intended to happen. I am guessing it must be caused by some settings specific to '<the_user>', but I cannot figure out what they are and they are not supposed to be like this. If these are settings we admins somehow caused, it was unintended.

Update 2

The problem has magically disappeared and can no longer be reproduced.

NB: some details have been anonymized as <something> above.

Thomas Arildsen
  • 1,079
  • 2
  • 14
  • 31
  • 2
    This problem has magically disappeared and I can no longer reproduce it. The user tried running a Slurm job with the `-p` option (specifying our only partition 'batch'). We usually don't use `-p` since we have only one partition. Since then we have been unable to reproduce the described problem. I have no idea why. – Thomas Arildsen Apr 26 '21 at 21:04

1 Answers1

2

I had the same problem for days with various trials, and with -p option, the random killing problem is magically gone.

Thank you, Thomas Arildsen, for sharing your solution in the comment.

General Grievance
  • 4,555
  • 31
  • 31
  • 45
Seok
  • 21
  • 2