How can i cluster document using k-means (Flann with python)?

Question

I want to cluster documents based on similarity.

I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).

I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.

My questions are:

Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
Is k-means the right algorithm?

This is pretty much the same question as http://stackoverflow.com/questions/8057442/document-clustering-basics. I suggest you look at scikit-learn, which has most of the functionality you need built-in, and is pretty scalable: http://scikit-learn.org/stable/auto_examples/document_clustering.html. Alternatively, NLTK also has k-means: http://nltk.org/_modules/nltk/cluster/kmeans.html. — Fred Foo, Sep 19 '12 at 14:59
Thanks a lot. How is performance scikit vs nltk , can you comment on them? — Phyo Arkar Lwin, Sep 19 '12 at 15:35
I've never tried NLTK's clustering, but it's a safe bet to assume scikit-learn will be orders of magnitude faster for largish datasets. NLTK may be easier to use, though. — Fred Foo, Sep 19 '12 at 15:54
If you need to do approximate k-NN queries, then FLANN is state of the art (there is no approximate k-NN query module in scikit-learn nor NLTK as far as I know). But k-NN queries and K-Means clustering are not solving the same problem. — ogrisel, Oct 10 '12 at 09:35

dhg · Accepted Answer · 2012-09-19T15:00:48.240

20

You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.

So here's what you do:

Count up the number of times each word appears in the document.
Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.
Make a vector for each document based on the counts of the feature words.

Here's an example.

If your "documents" are single sentences, and they look like (one doc per line):

there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog

If my set of feature words are [dog, cat, street, pizza, lunch], then I can convert each document into a vector:

[1, 1, 0, 0, 0]  // dog 1 time, cat 1 time
[0, 0, 0, 1, 1]  // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0]  // dog 2 times, cat 1 time, street 1 time

You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.

edited Sep 19 '12 at 15:00

answered Sep 19 '12 at 14:55

dhg

52,383
8
123
144

Very interesting , i read somewhere a few days ago that scikit.learn have such feature to vectorize any text file or strings. I am wondering the data structure that it gives out will be suitable for Flann? – Phyo Arkar Lwin Sep 19 '12 at 14:58
I would just add that you could use some stemming algorithm to ensure you consider small variations of the same word as the same keyword. This will decrease the number of variables and should make the overall process more accurate. See this link for more info [link](http://packages.python.org/Whoosh/stemming.html) – jpsfer Sep 19 '12 at 16:08
yes that would be nice , i can use NLTK to produce / tokenize words. – Phyo Arkar Lwin Sep 21 '12 at 13:00
If I don't have any feature words and my bag of words is just "any word that's not a stopword" will each vector be the length of all possible words, with each indice representing the occurences of the word assigned to that indice? – Carpetfizz Nov 01 '16 at 20:39
@Carpetfizz. That's true. – CKM Mar 06 '17 at 11:22

score 14 · Answer 2 · answered Sep 19 '12 at 18:04

There is one big problem here:

K-means is designed for Euclidean distance.

The key problem is the mean function. The mean will reduce variance for Euclidean distance, but it might not do so for a different distance function. So in the worst case, k-means will no longer converge, but run in an infinite loop (although most implementations support stopping at a maximum number of iterations).

Furthermore, the mean is not very sensible for sparse data, and text vectors tend to be very sparse. Roughly speaking the problem is that the mean of a large number of documents will no longer look like a real document, and this way become dissimilar to any real document, and more similar to other mean vectors. So the results to some extend degenerate.

For text vectors, you probably will want to use a different distance function such as cosine similarity.

And of course you first need to compute number vectors. For example by using relative term frequencies, normalizing them via TF-IDF.

There is a variation of the k-means idea known as k-medoids. It can work with arbitrary distance functions, and it avoids the whole "mean" thing by using the real document that is most central to the cluster (the "medoid"). But the known algorithms for this are much slower than k-means.

Thank you so much for pointing that out. any implementations of K-medoids that you recommend? — Phyo Arkar Lwin, Sep 21 '12 at 13:17

How can i cluster document using k-means (Flann with python)?

2 Answers2