I am using sklearn Kmeans Minibatch for clustering large data and I get a memory error.
Here is my laptop configuration on this configuration its working fine:
- Core i5 64 bit
- Python 3.6.2
- 8 GB RAM
I stored TfidfVectorizer X in .npz file(426 Mb). I then perform Clustering on that X several times with a different number clusters.
X = sparse.load_npz("D:\clustering_final\sp-k2.npz")
n_samples: 850900, n_features: 1728098
Clustering sparse matrix data with MiniBatchKMeans
Batch_size=1000, n_clusters=500, compute_labels=True, init='k-means++', n_init=100
My python script works fine on this laptop configuration but when I use the same Python(everything same Copied python36 folder as it is) on another laptop, it gives a memory error. Even though the configuration for the other laptop is high:
- Core i5 64 bit
- Python 3.6.2
16 GB RAM
km.fit(X) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 1418, in fit init_size=init_size) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 684, in _init_centroids x_squared_norms=x_squared_norms) File "C:\python36\lib\site-packages\sklearn\cluster\k_means_.py", line 79, in _k_init centers = np.empty((n_clusters, n_features), dtype=X.dtype) MemoryError
I checked all of the required libraries and other dependencies but its running perfectly on low configuration laptop. Why doesn't it run on a high configuration laptop?
I know this sounds strange, but its true.