2

I am working on parallelizing an sklearn grid search, sweeping three parameters, but I am having trouble refactoring the project to work with ipython.parallel. My current thought approach has been to create a simple function which:

  • Accepts a a ridge parameter
  • Downloads the data set to train the model with
  • Trains the model, saving a score and the resulting model on S3?

Does this make sense as an approach to parallelizing the grid search?

If so, is there any way to share code between my local machine and the remote engines?

For example, I have a source tree containing a number of different modules:

/exploration
    /log_regression/
    /log_regression/experiments.py
    /log_regression/make_model.py
    /linear_regression/
    /linear_regression/experiments.py
    /linear_regression/make_model.py
    /linear_regression/parallel.py

Using StarCluster, I have deployed a cluster on EC2 and wanted to parallelize the process of running a sklearn gridsearch on ridge parameters. However, I have found no easy way to share all of my local modules with the remote engines. Is there a pattern for doing this, or how should I restructure my thinking?

Cory Dolphin
  • 2,650
  • 1
  • 20
  • 30
  • 1
    Have you tried sharing your drive with your cluster nodes? StarCluster supports NFS out of the box. – Finch_Powers Mar 03 '14 at 22:41
  • I do not see an easy way to mount a local folder as NFS, but using this insight, I just rsync'd the directory onto the sgeadmin home directory, which is the default ipcluster working directory. – Cory Dolphin Mar 04 '14 at 17:13
  • That said, it seems like the code doesn't really synchronize very well. When running things using iPython.parallel's client, they seem to import older versions of the modules when running, which may be due to the way imports occur. I will attempt to force reimport – Cory Dolphin Mar 05 '14 at 20:02

1 Answers1

0

If it's a matter of deploying the code onto several nodes, and not of designing your code for parallel processing, then you might consider making your code accessible through a local (to your network) source code management (git or mercurial server) then script the deployment: some utility that would connect to all nodes, prior to launching your processing, and prepares the working environment. And of course this would involve checking out the most recent version of your code along with necessary dependencies. Assuming you're using a unix-like OS, there are some python utilities to help with this:

  • virtualenv, for self-contained python environments with access to many python libraries through pip.
  • paramiko, for scripting ssh connection and shell interaction
  • vcstools, a library that abstracts common source code operations for several scms (svn, git, mercurial, ...)

Also if you don't want to go through the hassle of setting up a source code management from scratch you can host your code on github or if you don't want your code to be publicly available you can go for bitbucket (which offers free private repositories and also a choice between git and mercurial).

b2Wc0EKKOvLPn
  • 2,054
  • 13
  • 15