1

my goal is to perform clustering of texts from a dataset of millions of rows, where each row is a string of words, which doesn't correspond to a proper document, but rather to a list of "keywords". The idea is that each row represents a Twitter user with the list of keywords taken from his/her tweets, here is an example of a row:

"remove United States District Attorney Carmen Ortiz office overreach case Aaron Swartz"

Here is my code:

from __future__ import print_function
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import MiniBatchKMeans
from time import time
import csv

# LOAD CSV
print("Loading Dataset from a CSV...")
csvinputfile = '...'
t = time()
dataset = open(csvinputfile, 'r')
print("done in %0.3fs" % (time() - t))
print("")

# TERM OCCURRENCES
print("Calculating Term Occurrences...")
t = time()
vectorizer = HashingVectorizer(n_features=300000, stop_words=None, alternate_sign=False, norm='l2', binary=False)
x = vectorizer.fit_transform(dataset)
print("done in %0.3fs" % (time() - t))
print("")

# CLUSTERING
print("MiniBatchKMeans Clustering...")
t = time()
km = MiniBatchKMeans(n_clusters=10000, init='k-means++', n_init=1, init_size=10000, batch_size=10000, verbose=False)
clusters = km.fit(x)
print("done in %0.3fs" % (time() - t))
print("")

My problem is that when it comes to do clustering I get a Memory Error:

MiniBatchKMeans Clustering...
Traceback (most recent call last):
File ".../cluster-users.py", line 32, in <module> clusters = km.fit(x)
File ".../python2.7/site-packages/sklearn/cluster/k_means_.py", line 1418, in fit init_size=init_size)
File ".../python2.7/site-packages/sklearn/cluster/k_means_.py", line 684, in _init_centroids x_squared_norms=x_squared_norms)
File ".../python2.7/site-packages/sklearn/cluster/k_means_.py", line 79, in _k_init centers = np.empty((n_clusters, n_features), dtype=X.dtype)
MemoryError
[Finished in 22.923s]

I am quite new to python and scikitlearn so I don't understand really well what is happening, but I assume it is because, since I am dealing with a large dataset, the clustering phase is trying to load the huge matrix of n_samples and n_features into memory.

A part from that error, which I don't understand since I thought that MiniBatchKMeans and HashingVectorizer was what could help against memory limits, I also don't really know what are the best parameters to use (I followed the scikitlearn tutorial for KMeans and MiniBatchKMeans for clustering texts as a base, you can find it here http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#sphx-glr-auto-examples-text-document-clustering-py).

Things to remember:

  • I cannot use any Machine Learning or NLP or MapReduce like techniques
  • The clusters needs to somehow represent users with similar interests, therefore similar keywords usage

So my question is: how can I fix the memory error? And if someone has some hints on how to properly set up the clustering, or if my approach is wrong, that would also be nice.

  • Dont load the all the data at once into the program. You are using MiniBatchKMeans, so make use of its `partial_fit()` method. Load small chunk of data, transform it using HashingVectorizer and pass that to `partial_fit()`. – Vivek Kumar Sep 04 '18 at 07:03
  • Will I have different results from doing the hashing vectorization on all the data, and then passing the resulting matrix to minibatchkmeans in chunks? I am asking this because I've tried it and as a result in "km.labels_" I only found the last chunk of data I partial fitted. –  Sep 04 '18 at 10:19
  • @VivekKumar Ohh another thing: if I want to add a dimensionality reduction step, is it ok to add it right after the HashingVectorizer phase or is it conceptually/logically wrong? So if I am understanding right I have to: for each chunk of raw data -> do hashing vectorization -> do dimensionality reduction -> do minibatchkmeans -> save results of that chunk -> repeat for another chunk until I process the whole dataset –  Sep 04 '18 at 12:33
  • If you want to reduce dimensions, why not start with lower n_features in HashingVectorizer in the first place? – Vivek Kumar Sep 04 '18 at 13:17
  • since my data are short texts, shouldn't be the number of features equals to the number of possible unique words? Otherwise I will lose information, at least this is what I understood, and then in order to lower the computational cost I perform dimensionality reduction. –  Sep 04 '18 at 13:36

1 Answers1

1

A row containing text like this, "remove United States District Attorney Carmen Ortiz office overreach case Aaron Swartz" is indeed dirty.

To fix the memory error, ensure the following points stand true;

  • Are all the keywords in a row relevant? If not, try to reduce them by removing stop words, punctuation marks etc.

  • Focus on aggregating relevant keywords from the text. You can create a list of such keywords.

  • Look for the regex library in python. It can help you with data cleaning.

  • To determine the best parameters, look for terms like within cluster sums of squares or the average silhouette or gap statistic.

Clustering is not some kind of dark-magic that will yield results. If you input garbage it will yield garbage.

P.S. Please do not create new questions for the same problem. There is already another similar question that you've asked recently. Unless, the two questions are radically different, create a new question otherwise state it clearly in your post, how is this question different from the previous question.

maverick
  • 315
  • 2
  • 10
  • The strings are already cleaned from those words. Sorry for the second question, but with this I wanted mainly to address the memory error. –  Sep 02 '18 at 07:47