0

I am using sklearn Kmeans Minibatch for clustering large data and I get a memory error.

Here is my laptop configuration on this configuration its working fine:

  1. Core i5 64 bit
  2. Python 3.6.2
  3. 8 GB RAM

I stored TfidfVectorizer X in .npz file(426 Mb). I then perform Clustering on that X several times with a different number clusters.

X = sparse.load_npz("D:\clustering_final\sp-k2.npz")

n_samples: 850900, n_features: 1728098

Clustering sparse matrix data with MiniBatchKMeans

Batch_size=1000, n_clusters=500, compute_labels=True, init='k-means++', n_init=100

My python script works fine on this laptop configuration but when I use the same Python(everything same Copied python36 folder as it is) on another laptop, it gives a memory error. Even though the configuration for the other laptop is high:

  1. Core i5 64 bit
  2. Python 3.6.2
  3. 16 GB RAM

    km.fit(X) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 1418, in fit init_size=init_size) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 684, in _init_centroids x_squared_norms=x_squared_norms) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 79, in _k_init centers = np.empty((n_clusters, n_features), dtype=X.dtype) MemoryError

I checked all of the required libraries and other dependencies but its running perfectly on low configuration laptop. Why doesn't it run on a high configuration laptop?

I know this sounds strange, but its true.

Mihir
  • 31
  • 7

0 Answers0