1

I am trying to use sklearn MiniBatchKMeans to cluster a fairly large dataset (150k samples and 150k features). I thought I could make things much faster using Incremental from dask_ml to fit my data in chunks. Here is a snippet of my code on a dummy dataset:

    from dask_ml.datasets import make_blobs
    from dask_ml.wrappers import Incremental
    from sklearn.cluster import MiniBatchKMeans
    import dask.array as da
    import dask

    dataset = da.random.random((150000, 150000), chunks = (1000, 1000))
    kmeans = MiniBatchKMeans(n_clusters = 3)
    inc = Incremental(kmeans).fit(dataset)
    predicted_labels = inc.predict(dataset).compute()
    print(predicted_labels) 

The process gets killed at the compute() step. I didn't think running compute() on 150k points would be so intensive. It fails with this strange error:

ValueError: X has 150000 features, but MiniBatchKMeans is expecting 1000 
features as input.

I don't understand what the size MiniBatchKMeans features have to do with compute() on the labels

EDIT After the first answer, I would like to clarify that I use compute() on the labels (not the dataset!) because I need them for some plotting operations. These values need to be on RAM in order to be used by matplotlib functions.

An array of (150k, ) should be able to fit comfortably on RAM, I am not sure why it fails!

coolbeans
  • 11
  • 2

1 Answers1

0

The process breaks down because the compute() method will bring the entire dataset to local RAM. See the documentation:

You can turn any dask collection into a concrete value by calling the .compute() method or dask.compute(...) function.

And further:

However, this approach often breaks down if you try to bring the entire dataset back to local RAM.

>>> df.compute() # MemoryError(...)

This is also why you get a value error with mismatched sizes, as you will pass the entire dataset to the predict() method at once instead of passing the smaller chunks one by one. Remove the compute() statement and it will work fine:

predicted_labels = inc.predict(dataset)

Since your goal seemed to be to "make things much faster", also note the following:

Each block of a Dask Array is fed to the underlying estimator’s partial_fit method. The training is entirely sequential, so you won’t notice massive training time speedups from parallelism. In a distributed environment, you should notice some speedup from avoiding extra IO, and the fact that models are typically much smaller than data, and so faster to move between machines.

(from here)

afsharov
  • 4,774
  • 2
  • 10
  • 27
  • Thank you for the answer. Of course it works when I do not call compute(). The idea was to get these labels on RAM because I need to do some plotting with them. I have no problems with execution time, however not being able to access the predicted labels is the issue – coolbeans Jun 14 '21 at 14:43
  • I also tried passing only a part of dataset, something like: inc.predict(dataset[:10, :]) but this also fails with the same error message – coolbeans Jun 14 '21 at 14:46