How to efficiently compute the cosine similarity between millions of strings

Question

I need to compute the cosine similarity between strings in a list. For example, I have a list of over 10 million strings, each string has to determine similarity between itself and every other string in the list. What is the best algorithm I can use to efficiently and quickly do such task? Is the divide and conquer algorithm applicable?

EDIT

I want to determine which strings are most similar to a given string and be able to have a measure/score associated with the similarity. I think what I want to do falls in line with clustering where the number of clusters are not initially known.

By definition of your problem, you will have a complexity of O(n²) executions of the cosine similarity computation. — Xion345, Feb 23 '13 at 14:46
@Xion345 Yes, is this acceptable for such a large data? I do not think it is — Kennedy, Feb 23 '13 at 14:48
You have to employ dynamic programming for that. See ***[this](http://en.wikipedia.org/wiki/Approximate_string_matching)*** link — Extreme Coders, Feb 23 '13 at 14:49
10 million times 10 million = 1 thousand billions. If each application of the cosine similarity requires 50 microseconds then you get a minimum time of about 50 million seconds, which is about one and a half years. If the cosine similarity takes just 1 nanosecond you'd still need 1000 seconds. So even with dynamic programming you can't do much. You should find a way to avoid all these computations of the cosine similarity. — Bakuriu, Feb 23 '13 at 15:02
The question is: why would you need to know all of the cosine similarities in the first place? If you have large scale text matching tasks then the standard book that I would recommend is Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology by Dan Gusfield. Don't be fooled by its name. Computational Biology is mostly about (DNA) string matching. The queries in this field are much more refined than the famous google search. — Udo Klein, Feb 23 '13 at 14:52
@Xion345 well, you can however save computing 0*something, if you are smart. If you exploit sparsity, it's only quadratic on the non-sparse parts. Which is much better. — Has QUIT--Anony-Mousse, Feb 23 '13 at 15:48

score 0 · Answer 1 · answered Feb 23 '13 at 15:50

Work with the transposed matrix. That is what Mahout does on Hadoop to do this kind of task fast (or just use Mahout).

Essentially, computing cosine similarity the naive way is bad. Because you end up computing a lot of 0 * something. Instead, you better work in columns, and leave away all 0s there.

score 0 · Answer 2 · answered Feb 23 '13 at 19:14

You could try SimString.

It is a C++ library (with Python or Ruby bindings) for approximate string matching.

It claims to find strings with high cosine similarity in under 1 millisecond for a database of 13 million strings.

The algorithm used is described here based on pruning of inverted lists.

How to efficiently compute the cosine similarity between millions of strings

2 Answers2