clustering large data set using dask

Question

I ve installed dask. My main aim is clustering a large dataset, but before starting work on it, I want to make a few tests. However, whenever I want to run a dask code piece, it takes too much time and a memory error appears at the end. I tried their Spectral Clustering Example and the short code below.

Do you think what is the problem?


from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

import datetime

X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, cluster_std = 2.1)
client = Client()

now = datetime.datetime.now()
model = DBSCAN(eps = 0.5, min_samples = 30)
with parallel_backend('dask'):
    model.fit(X)
print(datetime.datetime.now() - now)

score 1 · Accepted Answer · answered May 18 '19 at 15:21

1

The Scikit-Learn algorithms are not designed to train over large datasets. They are designed to operate on data that fits in memory. This is described here: https://ml.dask.org/#parallelize-scikit-learn-directly

Projects like Dask ML do have other algorithms that look like Scikit-Learn, but are implemented differently that do support larger dataset sizes. If you're looking for clustering then you might be interested in this page to see what is currently supported: https://ml.dask.org/clustering.html

answered May 18 '19 at 15:21

MRocklin

55,641
23
163
235

2

Thank you for your answer. Those supported dask clustering algorithms don't show expected outputs. I have to use the density-based clustering method (DBSCAN). Do you think which tech should I use? I also search for apache spark for big data clustering, however, it has the same issue with Dask, DBSCAN is not in their clustering list. DBSCAN may be implemented using without their library into the Apache Spark. But I think It won't be efficient. I like to learn what do you think about it. – emily.mi May 20 '19 at 10:11

clustering large data set using dask

1 Answers1