2

I am running many repeats of the same job using numpy on a cluster that uses sun grid engine to distribute jobs (starcluster). Each of my nodes has 2 cores (c3.large on AWS). So say I have 5 nodes, each with 2 cores.

The matrix operations in numpy are able to use more than one core at a time. What I'm finding is that SGE will send out 10 jobs to run at once, each job using a core. This is causing longer runtimes for the jobs. Looking at htop, it looks like the two jobs on each core are fighting over resources.

How can I tell qsub to distribute 1 job per node. So that when I submit my jobs, only 5 will be running at once, not 10?

bill_e
  • 930
  • 2
  • 12
  • 24

1 Answers1

5

Step 1: Add a complex values to your cluster. Run

qconf -mc

Add a line like

exclusive        excl      INT         <=    YES         YES        0        0

Step 2: For each of your nodes, define a value for that complex value.

qconf -rattr exechost complex_values exclusive=1 <nodename>

Here we set exclusive to 1. Then, when you launch jobs, request "1" of that resource. Eg.:

qrsh -l exclusive=1 <myjob>

If you were willing to have 2 jobs per node, you could define that value to 2 at step 2.

EDIT: This is how to configure it per node. You could have done it for the entire cluster in step 1 by setting the value into the "default" column to 1.

Finch_Powers
  • 2,938
  • 1
  • 24
  • 34
  • "EDIT: This is how to configure it per node. You could have done it for the entire cluster in step 1 by setting the value into the "default" column to 1." This is exactly what I want to do. I tried that, but it still kicked off 10 jobs at once, instead of 5. Do I need to run the "excl" command? – bill_e Feb 26 '16 at 17:32
  • Then when launching jobs, do: "qrsh -l exclusive=1 qsub -e ..." like that..? – bill_e Feb 26 '16 at 17:38
  • No, qsub and qrsh are both tools to launch jobs. qsub detaches, qrsh is interactive. So you can just replace qrsh by qsub in the example. – Finch_Powers Feb 26 '16 at 17:42
  • Maybe the default is not working. Try to set it per node then with the qconf -rattr command. – Finch_Powers Feb 26 '16 at 17:43