I am working on parallelizing an sklearn grid search, sweeping three parameters, but I am having trouble refactoring the project to work with ipython.parallel. My current thought approach has been to create a simple function which:
- Accepts a a ridge parameter
- Downloads the data set to train the model with
- Trains the model, saving a score and the resulting model on S3?
Does this make sense as an approach to parallelizing the grid search?
If so, is there any way to share code between my local machine and the remote engines?
For example, I have a source tree containing a number of different modules:
/exploration
/log_regression/
/log_regression/experiments.py
/log_regression/make_model.py
/linear_regression/
/linear_regression/experiments.py
/linear_regression/make_model.py
/linear_regression/parallel.py
Using StarCluster, I have deployed a cluster on EC2 and wanted to parallelize the process of running a sklearn gridsearch on ridge parameters. However, I have found no easy way to share all of my local modules with the remote engines. Is there a pattern for doing this, or how should I restructure my thinking?