Run IPython Parallel on SGE cluster from local controller (using queue system)

Question

Regarding IPython Parallel, from the Documentation and several posts I found on the Internet, I know I can start a controller on a machine and the engines on another through SSH. However, I'd like to use IPython Parallel on a SGE cluster but starting the controller in the local machine, and launching the engines through the queue system. (The reason to launch the controller in the local machine is to be able to use local nodes as well).

In the local machine, I have added c.HubFactory.ip = '*' in the ipcontroller_config.py in profile_x. I can start the controller successfully with
ipcontroller --profile=x

and I can also start an engine locally with ipengine and connect to it with

c=Client(url_file='/path/to/profile_x/security/ipcontroller-client.json')

Now the question is how can I launch engines in the cluster such that they are managed by the queue system and connect to the controller in my local machine? So far I did: I created a new profile profile_y on the cluster. I copied ipcontroller-engine.json from profile_x to the security folder in profile_y and modified the configuration files in profile_y as follows:

In ipengine_config.py :

c.EngineFactory.ip = '*'
c.EngineFactory.sshserver = 'mylocalmachineserver'

In ipcluster_config.py:

c.IPClusterEngines.engine_launcher_class = 'SGE'

But when doing

ipengine --profile=y

A new engine is created on the node where I am, not through the queue system. I would like to be able to start n engines through the SGE system. I guess I will need to specify a keyfile with the password to connect to my localmachine as well. I would be glad if you could help with that.

Moreover, is it possible to "dynamically" connect to engines as they are launched in case not all of them can be created at once due to lack of free slots on the cluster?

Thanks for your help.

score 0 · Answer 1 · answered Sep 25 '15 at 15:06

controller on the same LAN as the engines

The simple case is when the controller is on the same network as the engines, e.g. a login node or other work node, and that engines can connect to it. In this case, you will want the following config:

in ipcontroller_config.py, tell the controller to listen on all IPs (see caveat for exceptions to this):

c.HubFactory.ip = '*' # see caveat for cases where '*' may not work

in ipcluster_config.py, tell ipcluster to use SGE to launch engines:

c.IPClusterEngines.engine_launcher_class = 'SGE'

That's about all you should need. Then you can start up with:

ipcluster start

or run the controller manually with

ipcontroller

and bring up engines after the fact, with

ipcluster engines -n 32

controller outside the cluster, with ssh

More complicated is starting the controller outside the network (e.g. on your laptop), while also starting the engines on the cluster. One reason for this is that the SGELauncher needs qsub to be a local command, which it probably isn't on your laptop. For this, you need to use two sets of config - one for telling ipcluster to ssh to the cluster and start engines, and one on the cluster to tell it to use SGE.

For this bit, I'm going to assume that the controller machine is ssh-able from the engines.

controller

On the controller, you will want to set the engine SSH server in ipcontroller_config.py:

c.IPControllerApp.engine_ssh_server = 'mylocalmachineserver'

And tell local calls to ipcluster to actuall call ipcluster on the cluster via ssh in ipcluster_config.py:

c.IPClusterEngines.engine_launcher_class = 'SSHProxy'
c.SSHProxyEngineSetLauncher.hostname = 'cluster-login-host'

cluster

On the cluster, you will have to create a profile with ipcluster_config.py:

c.IPClusterEngines.engine_launcher_class = 'SGE'

And that should be it.

Starting the cluster

Now, what happens when you start the cluster with ipcluster start on mylocalmachineserver:

starts a local ipcontroller, listening on localhost, wroting the ssh host in the engine connection file
sends connection files to cluster-login-host
ssh to cluster-login-host and run ipcluster engines
on cluster-login-host it picks up local config, and spawns engines with SGE
engines on the cluster see the engine ssh server, and tunnel localhost to localhost on mylocalmachineserver
hopefully everything works!

Caveats

On clusters, it's common to have loads of network interfaces, and sometimes only one of them will actually work for engines to connect. If this is the case, it's often easier to specify a specific IP, rather than '*', which forces IPython to do some guessing when it tries to make connections. For instance, if you know that eth1 is the network interfaces where your nodes can see each other, then using the IP for eth1 may be best. netifaces is a useful library for getting this sort of information:

import netifaces
eth0 = netifaces.ifaddresses('eth0')
c.HubFactory.ip = eth0[netifaces.AF_INET][0]['addr']

Answers to sub-questions below:

c.EngineFactory.ip = '*'

This config is rarely, if ever, necessary, and should never be *. This is used to tell ipengine how to connect to the controller when the connection file doesn't provide the right information. Typically, the best solution is to get the connection file right in the first place (ipcontroller config), rather than set a value in engine config.

a new engine [started with ipengine] is created on the node where I am, not through the queue system.

IPClusterEngines config only affects when you start engines with ipcluster. If you want to launch one engine with SGE with this config, you would do:

ipcluster engines -n 1

I guess I will need to specify a keyfile with the password to connect to my localmachine as well.

If you need to specify ssh config, you can do it in your ~/.ssh/config. IPython uses the command-line ssh to set up tunnel, so any ssh aliases, etc. will work.

If your controller machine is on the same network as the engines, you probably don't need to use SSH at all. Typically, one sets c.HubFactory.ip = '*' or one uses an ssh tunnel. The only time to use both of these is when the Hub is not on the same network as the engines at all, and the engines have to ssh to a machine on the same network as the controller, and then the ssh server connects to the controller on a LAN IP.