Euclidean vs Cosine for text data

Question

IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text...

I wrote some basic code to test this and found indeed they are comparable, not exactly same floating point value but it looks like a scaled version. Given below are the results of both the similarities on simple demo text data. text no.2 is a big line of about 50 words, rest are small 10 word lines.

Cosine similarity: 0.0, 0.2967, 0.203, 0.2058

Euclidean distance: 0.0, 0.285, 0.2407, 0.2421

Note: If this question is more suitable to Cross Validation or Data Science, please let me know.

score 2 · Accepted Answer · answered Apr 27 '15 at 16:37

2

If your data is normalized to unit length, then it is very easy to prove that

Euclidean(A,B) = 2 - Cos(A,B)

This does hold if ||A||=||B||=1. It does not hold in the general case, and it depends on the exact order in which you perform your normalization steps. I.e. if you first normalize your document to unit length, next perform IDF weighting, then it will not hold...

Unfortunately, people use all kinds of variants, including quite different versions of IDF normalization.

answered Apr 27 '15 at 16:37

Has QUIT--Anony-Mousse

76,138
12
138
194

So is there any specific advantage to the cosine distance metric for text ? – Soumyajit Apr 27 '15 at 16:42
1

It can be computed more efficiently *if* you have a good sparse vector implementation. It's what text search engines like Lucene exploit - you can skip over all 0 values; for Euclidean distance you can only skip attributes that are identical (and thus have a difference of 0). – Has QUIT--Anony-Mousse Apr 27 '15 at 16:44

Euclidean vs Cosine for text data

1 Answers1