This question is relatively high-level, but I'll offer some suggestions that might be of use.
To begin with, your code as written runs most things locally. To execute ML training in parallel you'll want to:
- Work on a cluster (locally or remote).
- Store data in Dask arrays or dataframes
- Use
dask.delayed
tasks
OR
- Use the
client.submit()
API
1. Create a (Local) Cluster
From your code it's not clear whether you've instantiated a client, so perhaps just double-check that you're following the dask-ml docs instructions here:
from dask.distributed import Client
import joblib
client = Client(processes=False) # create local cluster
# import coiled # or connect to remote cluster
# client = Client(coiled.Cluster())
with joblib.parallel_backend('dask'):
# your scikit-learn code
However, note that the Dask joblib backend to scitkit-learn is useful for scaling out CPU-bound workloads. To scale out to RAM-bound workloads (larger-than-memory datasets) you'll want to consider using one of the dask-ml
parallel estimators, such as suggested below.
2. Storing Data in Dask Arrays
The minimal code example below sets up two dummy datasets as Dask arrays and instantiates a K-Means clustering algorithm.
import dask_ml.datasets
import dask_ml.cluster
import matplotlib.pyplot as plt
# create dummy datasets
X, y = dask_ml.datasets.make_blobs(n_samples=10000000,
chunks=1000000,
random_state=0,
centers=3)
X2, y2 = dask_ml.datasets.make_blobs(n_samples=10000000,
chunks=1000000,
random_state=3,
centers=3)
# persist predictor sets to cluster memory
X = X.persist()
X2 = X2.persist()
# instantiate KM model
km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
3. Training in Parallel with Dask.Delayed
The code below runs the training in parallel using the dask.delayed
API. It follows the best practices outlined in the Dask docs.
from dask import delayed
import dask
X = delayed(X)
X2 = delayed(X2)
@delayed
def train(model, X):
return model.fit(X)
# define task graphs (lazy evaluation, no computation triggered)
km1 = train(km, X)
km2 = train(km, X2)
# trigger computation and yield fitted models in parallel
km1, km2 = dask.compute(km1, km2)
4. Training in Parallel with Futures and client.submit
Alternatively, you can train in parallel using the client.submit()
API. This immediately returns a future that points to the ongoing computation, and eventually to the stored result. Read more in the docs here.
Based on your question formulation, I've assumed that your main priority here is to have the training run in parallel. This doesn't require manually assigning tasks to specific workers; Dask takes care of the scheduling and optimal distribution across workers for you. In case you are actually interested in manually assigning specific tasks to specific workers, I'd suggest taking a look at this SO answer.