Mathematical method for multiple document clustering by Cosine Similarity

Question

Cosine Similarity: is often used when comparing two documents against each other. It measures the angle between the two vectors. If the value is zero the angle between the two vectors is 90 degrees and they share no terms. If the value is 1 the two vectors are the same except for magnitude. Cosine is used when data is sparse, asymmetric and there is a similarity of lacking characteristics.

When I used cosine for two vectors (documents) I will get the results between according to following table

id        Doc1(TF)  Doc2 (TF)
London        5        3
Is            2        2
Nice         10        3
City          0        1

Then get normalization for that to the end. Then, I will get the cosine Cos(v1,v2)= 90%

BUT, If I have 10 documents that mean I have get

Cos(v1,v2)= ? 
Cos(v1,v3)= ?
Cos(v1,v5)= ?
Cos(v1,v6)= ?
Cos(v1,v7)= ?
Cos(v1,v8)= ?
Cos(v1,v9)= ?
Cos(v2,v3)= ? 
Cos(v2,v4)= ?
Cos(v2,v5)= ?

And so o n 

Until 

Cos(v9,v10)= ?

Then I have to compare the results.

Is the any fast method? How can i get the cos to 10 or more documents.

I know how can i get cosine for two Documents But how can i get about more document? I want the mathematical method.

There is no faster way when you want conclusive distance results, although I sure hope someone here will whip up some algorithm that proves me wrong. I've dealt with this type of issue before, except I was comparing distances between various paragraphs in two or more documents. — Sten Petrov, Dec 19 '12 at 15:55
There is no faster way to get all pairwise distances than to compute them (n * (n-1) / 2 pairs). However, your question mentions document clustering, and most clustering methods will require only distances from each cluster centroid to each point, not all pairwise distances (or equivalently the likelihood of each cluster for each point if you're clustering probabilistically). — Ben Allison, Dec 20 '12 at 10:44
you are describing distance measure (cosine similarity). The question is where you would like to apply it (in some classification task - probably not; in some clustering - it looks more likely). If I'm right, you need to choose some clustering algorithm and use cosine similarity as a distance measure inside it ... — xhudik, Dec 21 '12 at 11:45
maybe this post: http://stats.stackexchange.com/questions/28406/is-cosine-similarity-a-classification-or-a-clustering-technique would help — xhudik, Dec 21 '12 at 11:54
Clustering is NP-hard because the number of possible clusters is exponential. Clustering algorithms like k-Means can provide an approximate solution in polynomial time, but there's no guarantee that such solution will be optimal. — Diego, Dec 22 '12 at 00:06

score 1 · Answer 1 · answered Jan 01 '13 at 12:23

There is a pretty tricky optimization that is easy to overlook.

Most of the time, your vectors will be sparse. If you look at the formula of cosine similarity, note that the vector lengths won't be changing. So you can precompute them.

For the dot product, note that if your vectors are non-zero in 10% of the dimensions, both will be non-zero in just 1% of the dimensions. So you only want to be computing the products in dimensions that are non-zero! In your example, you want to skip the word City, for example.

If you then transpose the data into a column-based layout and drop the zeros there, you can compute this in a distributed manner quite quickly. For example, the document v1 would be missing in the City column. Then you compute the pairwise products in each column, and output them to the corresponding document pair. In the end, you normalize these sums with the total vector lengths. Mahout should be doing it this way already, so you should find details on this approach in a good book on Mahout (sorry, I don't have good pointers).

The key idea for processing large quantities of text is to exploit sparsity. Try to avoid any computation whose value will be 0 anyway.

score 0 · Answer 2 · answered Dec 25 '12 at 17:25

Here are some optimizations. As the matrix of pairwise distances is symmetric about the diagnol, only compute the upper triangle of the matrix. Moreover, to do the actual clustering, given that you have the matrix of pairwise distances, you can do this in n-1 iterations. A fast way to compute the matrix of pairwise distances is to use parallel programming (say the GPU). Results have shown that computing pairwise distances on the GPU is 64 faster than the CPU. However, for clustering algo like single link hierarchical clustering, the actual clustering has to be done on the CPU

Mathematical method for multiple document clustering by Cosine Similarity

2 Answers2