6

I have a large set of sklearn pipelines that I'd like to build in parallel with Dask. Here's a simple but naive sequential approach:

from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2)

pipe_nb = Pipeline([('clf', MultinomialNB())])
pipe_lr = Pipeline([('clf', LogisticRegression())])
pipe_rf = Pipeline([('clf', RandomForestClassifier())])

pipelines = [pipe_nb, pipe_lr, pipe_rf]  # In reality, this would include many more different types of models with varying but specific parameters

for pl in pipelines:
    pl.fit(X_train, Y_train)

Note that this is not GridSearchCV or RandomSearchCV problem

In the case of RandomSearchCV, I know how to parallelize it with Dask:

dask_client = Client('tcp://some.host.com:8786')  

clf_rf = RandomForestClassifier()
param_dist = {'n_estimators': scipy.stats.randint(100, 500}
search_rf = RandomizedSearchCV(
                clf_rf,
                param_distributions=param_dist, 
                n_iter = 100, 
                scoring = 'f1',
                cv=10,
                error_score = 0, 
                verbose = 3,
               )

with joblib.parallel_backend('dask'):
    search_rf.fit(X_train, Y_train)

However, I'm not interested in hyperparameter tuning and it isn't clear how to modify this code in order to fit a set of multiple different models with their own specific parameters in parallel with Dask.

slaw
  • 6,591
  • 16
  • 56
  • 109

1 Answers1

7

dask.delayed is probably the easiest solution here.

from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2)

pipe_nb = Pipeline([('clf', MultinomialNB())])
pipe_lr = Pipeline([('clf', LogisticRegression())])
pipe_rf = Pipeline([('clf', RandomForestClassifier())])

pipelines = [pipe_nb, pipe_lr, pipe_rf]  # In reality, this would include many more different types of models with varying but specific parameters

# Use dask.delayed instead of a for loop.
import dask.delayed

pipelines_ = [dask.delayed(pl).fit(X_train, Y_train) for pl in pipelines]
fit_pipelines = dask.compute(*pipelines_)
TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • Will this automatically detect/use the dask.distributed client assuming that the `dask_client = Client('tcp://some.host.com:8786') ` line is executed? Or does still need to get wrapped into the `joblib.parallel_backend('dask')`? – slaw Jan 25 '19 at 17:44
  • This is just using Dask, not joblib, so you don't need to use the `joblib.parallel_backend` context manager (though it won't hurt either). Dask will pick up the most recently created Client and use it as default. – MRocklin Jan 25 '19 at 19:29
  • Doing the `dask.compute(*pipelines_)` inside a `joblib.parallel_backend` context would potentially parallelize the training of individual *models* as well, if you specify `n_jobs`. I'm not sure how Dask would handle things if you're trying to parallelize both individual models and across models, but things may work out well. – TomAugspurger Jan 25 '19 at 20:26
  • Thank you for the insights! I also learned something new regarding `dask.delayed(pl).fit(X_train, Y_train)`. It doesn't look like the classic `inc` or `add` examples from the documentation and so I probably would've had some trouble without your help. – slaw Jan 26 '19 at 05:16
  • @slaw, I too have a similar problem at hand and came across this answer which cleared things up a bit for me. Just wondering if you've tried distributed training across a cluster using the same approach and whether dask is apt for the task. – Subrat Sahu May 08 '20 at 16:16