I am trying to use sklearn MiniBatchKMeans to cluster a fairly large dataset (150k samples and 150k features). I thought I could make things much faster using Incremental from dask_ml to fit my data in chunks. Here is a snippet of my code on a dummy dataset:
from dask_ml.datasets import make_blobs
from dask_ml.wrappers import Incremental
from sklearn.cluster import MiniBatchKMeans
import dask.array as da
import dask
dataset = da.random.random((150000, 150000), chunks = (1000, 1000))
kmeans = MiniBatchKMeans(n_clusters = 3)
inc = Incremental(kmeans).fit(dataset)
predicted_labels = inc.predict(dataset).compute()
print(predicted_labels)
The process gets killed at the compute() step. I didn't think running compute() on 150k points would be so intensive. It fails with this strange error:
ValueError: X has 150000 features, but MiniBatchKMeans is expecting 1000
features as input.
I don't understand what the size MiniBatchKMeans features have to do with compute() on the labels
EDIT After the first answer, I would like to clarify that I use compute() on the labels (not the dataset!) because I need them for some plotting operations. These values need to be on RAM in order to be used by matplotlib functions.
An array of (150k, ) should be able to fit comfortably on RAM, I am not sure why it fails!