3

I have a complex fortran MPI application running under a Torque/Maui system. When I run my application it produces a huge unique output (~20 GB). To avoid that, I produced a RunJob script that breaks up the running in 5 pieces, each producing smaller outputs much easier to handle.

For the moment my RunJob script stops correctly at the end of the first piece and also produces the correct output. However, when it tries to restart I get the following error message:

qsub: Bad UID for job execution MSG=ruserok failed validating username/username from compute-0-0.local

I know that this problem comes from the fact the Torque/Maui system by default does not allow a node to submit a job.

In fact, when I type this:

qmgr -c 'l s' | grep allow_node_submit

I got:

allow_node_submit = False

I do not have an administrator account just a user one

My questions are:

  1. Is it possible to set allow_node_submit = true on the gmgr being a user ? How ? (- I guess not)
  2. If question 1 = false, is there another way to work around this ? How ?

All the best.

fuesika
  • 3,280
  • 6
  • 26
  • 34
Quim
  • 161
  • 2
  • 7

1 Answers1

3

No, an unprivileged user can't change the settings of the queuing system. The usual reason for not allowing resubmission from the compute nodes is a pretty good one - to protect the cluster and all of its users from someone accidentally (or otherwise) submitting a script which fails quickly and re-submits itself once - or much worse, more than once - quickly flooding the scheduler and queue, generating the batch queue equivalent of a fork bomb. Even with such restrictions we've had people accidentally submit tens of thousands of jobs at once due to scripting errors.

The usual work around is to ssh to one of the queue submission nodes and submit the script from there, e.g. at the end of your submissions script:

ssh queue-head-node qsub /path/to/new/submission/script

This is how we suggest our users handle it, e.g. here. That obviously will only work if you have password/passphrase-less ssh enabled within the cluster, which is a common (but not universal) practice.

Alternatively, if this is for the common case of just automatically submitting a series of jobs which continue a run, you can look to see how job dependencies are handled at your site, and submit a convoy of jobs, each dependent on the successful completion of the last, which will then run in order.

Jonathan Dursi
  • 50,107
  • 9
  • 127
  • 158
  • Hi @Jonathan ... It worked perfectly.! Thanks a lot, really. I had to give the complete path to qsub - even though it is on my path. The final command is: ssh username@headnodename /opt/torque/bin/qsub path/to/my/application. Thanks again. I'm interested on your last sentence: "you can look to see how job dependencies are handled at your site". Could you give more information about it? Where to look for it? ... all the best. – Quim Aug 29 '14 at 20:31
  • It depends on your version of torque and how things are set up locally, so best is to ask your cluster admin, but there's some documentation [here](http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#dependencies). – Jonathan Dursi Aug 29 '14 at 21:17