2

I'm computing learning curves out of random forests using sklearn. I need to do it for lot of different RFs, therefore I want to use a cluster and Dask to reduce the time of the RFs fits.

Currently I implemented the following algorithm:

from sklearn.externals import joblib
from dask.distributed import Client, LocalCluster

worker_kwargs = dict(memory_limit="2GB", ncores=4)
cluster = LocalCluster(n_workers=4, threads_per_worker=2, **worker_kwargs) # processes=False?
client = Client(cluster)

X, Y = ..., ...
estimator = RandomForestRegressor(n_jobs=-1, **rf_params)
cv = ShuffleSplit(n_splits=5, test_size=0.2)
train_sizes = [...] # 20 different values

with joblib.parallel_backend('dask', scatter=[X,Y]):
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, Y, cv=cv, n_jobs=-1, train_sizes=train_sizes)       

Here are 2 levels of parallelism:

  • One for the fitting of a RF (n_jobs=-1)
  • One for the looping over all the training set sizes (n_jobs=-1)

My problem is: if the backend is loky, then it takes around 23s.

[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   22.8s finished

Now, if the backend is dask, then it takes more time:

[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   30.3s finished

I now that Dask introduces overhead, but I don't except that this explain all the difference of running time.

dask is being developed quickly and I find a lot of different versions to do the same thing, without knowing which one is up-to-date.

H4dr1en
  • 277
  • 2
  • 11

0 Answers0