-1

The problem of submitting jobs on SGE to run on complete nodes was addressed before in this forum. Several solutions have been suggested, one of which is to configure SGE to allow for the usage of the option -l excl=TRUE, another solution is to ask SGE for hard memory or load limits.

I'm using the cluster of my university for my master thesis, the parallel environment openmpi is configured with the fill-up strategy. Typically the nodes of the cluster contain 16 or 20 cores each, the problem is that some of the users instead of launching computations with a number of cores that is multiple of 16 (or 20), they launch their jobs with an arbitrary number of cores. As a result, when I launch a job with -pe openmpi 16, sometimes SGE will reserve the processors on 3 nodes (e.g. 6 + 1 + 10) which makes the computations very slow.

I asked the administrator to configure the cluster to allow for -l excl=TRUE but he refused to change the configuration before making tests (I don't know for how long).

Now I have a new idea that may allow me to have a similar result as (-l excl=TRUE) but without changing the cluster:

  1. Write a script that will scan the queue and estimate the number of cores that must be asked to SGE so that he fills all the running nodes and let only completely free nodes.
  2. Launch a fake job with the computed number of cores that will wait for a certain amount of time.
  3. launch my true job (e.g -pe openmpi 2*16=32).
  4. Delete the fake job to allow other users to use its cores

Can someone provide me an example of such code ?

FineUser
  • 1
  • 2
  • If you intend scanning free/utilized nodes before submitting, you also could try to request host-queues which also should result in what you want to achieve. `qsub -pe mpi 32 -q all.q@node01,all.q@node02`, provided that both nodes have 16 cores. – Thomas Mar 15 '18 at 17:30
  • @Thomas, the server is full now, I will try your suggestion after. However, do you know if there is a command for displaying hosts with no running jobs (I cant find something useful in the _qhost_ manual) ? – FineUser Mar 15 '18 at 18:06
  • `qstat -f` should report the single node-queues including allocated/free slots. – Thomas Mar 15 '18 at 18:08
  • @Thomas, thanks, I will see how to target free nodes from the displayed table. Maybe using awk or another command. Actually I try to automate the process of listing free nodes -> selecting e.g. 2 nodes from the list -> send job with your command above. – FineUser Mar 15 '18 at 18:13
  • @Thomas -q all. q@node01, all. q@node02 is not working, the job is automatically queued – FineUser Mar 20 '18 at 20:54
  • @Thomas I found a solution my self, launching jobs with qsub -l cpu=0 or cpu=0.1 works perfectly for me. – FineUser Mar 21 '18 at 14:02

1 Answers1

0

Launching jobs with qsub -l cpu=0 (or cpu=0.1) works perfectly for me.

Thomas
  • 4,225
  • 5
  • 23
  • 28
FineUser
  • 1
  • 2