my goal is to perform clustering of texts from a dataset of millions of rows, where each row is a string of words, which doesn't correspond to a proper document, but rather to a list of "keywords". The idea is that each row represents a Twitter user with the list of keywords taken from his/her tweets, here is an example of a row:
"remove United States District Attorney Carmen Ortiz office overreach case Aaron Swartz"
Here is my code:
from __future__ import print_function
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import MiniBatchKMeans
from time import time
import csv
# LOAD CSV
print("Loading Dataset from a CSV...")
csvinputfile = '...'
t = time()
dataset = open(csvinputfile, 'r')
print("done in %0.3fs" % (time() - t))
print("")
# TERM OCCURRENCES
print("Calculating Term Occurrences...")
t = time()
vectorizer = HashingVectorizer(n_features=300000, stop_words=None, alternate_sign=False, norm='l2', binary=False)
x = vectorizer.fit_transform(dataset)
print("done in %0.3fs" % (time() - t))
print("")
# CLUSTERING
print("MiniBatchKMeans Clustering...")
t = time()
km = MiniBatchKMeans(n_clusters=10000, init='k-means++', n_init=1, init_size=10000, batch_size=10000, verbose=False)
clusters = km.fit(x)
print("done in %0.3fs" % (time() - t))
print("")
My problem is that when it comes to do clustering I get a Memory Error:
MiniBatchKMeans Clustering...
Traceback (most recent call last):
File ".../cluster-users.py", line 32, in <module> clusters = km.fit(x)
File ".../python2.7/site-packages/sklearn/cluster/k_means_.py", line 1418, in fit init_size=init_size)
File ".../python2.7/site-packages/sklearn/cluster/k_means_.py", line 684, in _init_centroids x_squared_norms=x_squared_norms)
File ".../python2.7/site-packages/sklearn/cluster/k_means_.py", line 79, in _k_init centers = np.empty((n_clusters, n_features), dtype=X.dtype)
MemoryError
[Finished in 22.923s]
I am quite new to python and scikitlearn so I don't understand really well what is happening, but I assume it is because, since I am dealing with a large dataset, the clustering phase is trying to load the huge matrix of n_samples and n_features into memory.
A part from that error, which I don't understand since I thought that MiniBatchKMeans and HashingVectorizer was what could help against memory limits, I also don't really know what are the best parameters to use (I followed the scikitlearn tutorial for KMeans and MiniBatchKMeans for clustering texts as a base, you can find it here http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#sphx-glr-auto-examples-text-document-clustering-py).
Things to remember:
- I cannot use any Machine Learning or NLP or MapReduce like techniques
- The clusters needs to somehow represent users with similar interests, therefore similar keywords usage
So my question is: how can I fix the memory error? And if someone has some hints on how to properly set up the clustering, or if my approach is wrong, that would also be nice.