K-means minibatch Memory error on large data

Question

I am using sklearn Kmeans Minibatch for clustering large data and I get a memory error.

Here is my laptop configuration on this configuration its working fine:

Core i5 64 bit
Python 3.6.2
8 GB RAM

I stored TfidfVectorizer X in .npz file(426 Mb). I then perform Clustering on that X several times with a different number clusters.

X = sparse.load_npz("D:\clustering_final\sp-k2.npz")

n_samples: 850900, n_features: 1728098

Clustering sparse matrix data with MiniBatchKMeans

Batch_size=1000, n_clusters=500, compute_labels=True, init='k-means++', n_init=100

My python script works fine on this laptop configuration but when I use the same Python(everything same Copied python36 folder as it is) on another laptop, it gives a memory error. Even though the configuration for the other laptop is high:

Core i5 64 bit
Python 3.6.2
16 GB RAM

km.fit(X) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 1418, in fit init_size=init_size) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 684, in _init_centroids x_squared_norms=x_squared_norms) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 79, in _k_init centers = np.empty((n_clusters, n_features), dtype=X.dtype) MemoryError

I checked all of the required libraries and other dependencies but its running perfectly on low configuration laptop. Why doesn't it run on a high configuration laptop?

I know this sounds strange, but its true.

yes.. everything is same..everything i copied my python36 folder from one laptop to other also. — Mihir, Feb 11 '19 at 12:35

K-means minibatch Memory error on large data

0 Answers0