0

I have Dask distributed implemented with workers on Docker. I start 10 workers with a Docker compose file like so:

 docker-compose up -d --scale worker=10

To run a machine learning training of two models I do the following:

y1 = data1[label1]
X1 = data1[features1] 

y2 = data2[label2]
X2 = data2[features2] 

with joblib.parallel_backend('dask'):
        try:
            model1.fit(X1, y1)
            model2.fit(X2, y2)
        except Exception as e:
            logging.error('There's an error ' + str(e))

Now, I want to run in parallel the two trainings. I could use worker 1 to 5 for training 1 and worker 6 to 10 for training 2. But how to tell Dask distributed to use some workers for one task and other workers for a different task?

ps0604
  • 1,227
  • 23
  • 133
  • 330

1 Answers1

3

This question is relatively high-level, but I'll offer some suggestions that might be of use.

To begin with, your code as written runs most things locally. To execute ML training in parallel you'll want to:

  1. Work on a cluster (locally or remote).
  2. Store data in Dask arrays or dataframes
  3. Use dask.delayed tasks

OR

  1. Use the client.submit() API

1. Create a (Local) Cluster

From your code it's not clear whether you've instantiated a client, so perhaps just double-check that you're following the dask-ml docs instructions here:

from dask.distributed import Client
import joblib

client = Client(processes=False)        # create local cluster
# import coiled                         # or connect to remote cluster
# client = Client(coiled.Cluster())     

with joblib.parallel_backend('dask'):
    # your scikit-learn code

However, note that the Dask joblib backend to scitkit-learn is useful for scaling out CPU-bound workloads. To scale out to RAM-bound workloads (larger-than-memory datasets) you'll want to consider using one of the dask-ml parallel estimators, such as suggested below.

2. Storing Data in Dask Arrays

The minimal code example below sets up two dummy datasets as Dask arrays and instantiates a K-Means clustering algorithm.

import dask_ml.datasets
import dask_ml.cluster
import matplotlib.pyplot as plt

# create dummy datasets
X, y = dask_ml.datasets.make_blobs(n_samples=10000000,
                                   chunks=1000000,
                                   random_state=0,
                                   centers=3)

X2, y2 = dask_ml.datasets.make_blobs(n_samples=10000000,
                                   chunks=1000000,
                                   random_state=3,
                                   centers=3)

# persist predictor sets to cluster memory
X = X.persist()
X2 = X2.persist()

# instantiate KM model
km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)

3. Training in Parallel with Dask.Delayed

The code below runs the training in parallel using the dask.delayed API. It follows the best practices outlined in the Dask docs.

from dask import delayed
import dask

X = delayed(X)
X2 = delayed(X2)

@delayed
def train(model, X):
    return model.fit(X)

# define task graphs (lazy evaluation, no computation triggered)
km1 = train(km, X)
km2 = train(km, X2)

# trigger computation and yield fitted models in parallel
km1, km2 = dask.compute(km1, km2)

4. Training in Parallel with Futures and client.submit

Alternatively, you can train in parallel using the client.submit() API. This immediately returns a future that points to the ongoing computation, and eventually to the stored result. Read more in the docs here.

Based on your question formulation, I've assumed that your main priority here is to have the training run in parallel. This doesn't require manually assigning tasks to specific workers; Dask takes care of the scheduling and optimal distribution across workers for you. In case you are actually interested in manually assigning specific tasks to specific workers, I'd suggest taking a look at this SO answer.

Dharman
  • 30,962
  • 25
  • 85
  • 135
rrpelgrim
  • 342
  • 2
  • 13
  • Yes, the main objective is to have the training run in parallel and I'm fine with Dask assigning the tasks to the workers. I have a big Pandas dataframe to train, should I convert it to Dask dataframe before training, so it is distributed through the workers, or it doesn't matter? – ps0604 Oct 12 '21 at 13:11
  • @ps0604 yes, you'll absolutely want to convert your pandas dataframe to a dask dataframe with dask.dataframe.from_pandas(pandas_df). see here https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html?highlight=from_pandas – rrpelgrim Oct 12 '21 at 14:50