40

I have corpora of classified text. From these I create vectors. Each vector corresponds to one document. Vector components are word weights in this document computed as TFIDF values. Next I build a model in which every class is presented by a single vector. Model has as many vectors as there classes in the corpora. Component of a model vector is computed as mean of all component values taken from vectors in this class. For unclassified vectors I determine similarity with a model vector by computing cosine between these vectors.

Questions:

1) Can I use Euclidean Distance between unclassified and model vector to compute their similarity?

2) Why Euclidean distance can not be used as similarity measure instead of cosine of angle between two vectors and vice versa?

Thanks!

Yves M.
  • 29,855
  • 23
  • 108
  • 144
Anton Ashanin
  • 1,817
  • 5
  • 30
  • 43
  • 1
    This question appears to be off-topic because it is about statistics, not programming. Try http://stats.stackexchange.com/. – rob mayoff Oct 16 '13 at 20:51

4 Answers4

40

One informal but rather intuitive way to think about this is to consider the 2 components of a vector: direction and magnitude.

Direction is the "preference" / "style" / "sentiment" / "latent variable" of the vector, while the magnitude is how strong it is towards that direction.

When classifying documents we'd like to categorize them by their overall sentiment, so we use the angular distance.

Euclidean distance is susceptible to documents being clustered by their L2-norm (magnitude, in the 2 dimensional case) instead of direction. I.e. vectors with quite different directions would be clustered because their distances from origin are similar.

kizzx2
  • 18,775
  • 14
  • 76
  • 83
  • 2
    "[with Euclidean distance] vectors with quite different directions would be clustered because their distances from origin are similar" -> How is this true? In the extreme case, consider two diametrically opposite vectors with the same magnitude: these will have a large Euclidean distance between them even though their distance from the origin is identical. – xenocyon Dec 18 '15 at 22:34
  • 3
    @xenocyon Consider the case when their magnitude to the origin is small – kizzx2 Dec 20 '15 at 08:34
  • 2
    If you have three documents with sentiments -1, 1, 100, which two are closer: the first two or the second two? I think it's only possible to answer when you know the specific problem you're working on. – BallpointBen Jul 20 '17 at 13:19
  • 1
    I think the concern is rather that 2 vectors that *should* be clustered together (because they "point in the same direction") wouldn't be with Euclidean distance if their norms were significantly different, e.g. if one document contains many occurrences of the key tokens when the other does not. – wesholler Jun 15 '18 at 15:01
23

I'll answer the questions in reverse order. For your second question, Cosine Similarity and Euclidian Distance are two different ways to measure vector similarity. The former measures the similarity of vectors with respect to the origin, while the latter measures the distance between particular points of interest along the vector. You can use either in isolation, combine them and use both, or look at one of many other ways to determine similarity. See these slides from a Michael Collins lecture for more info.

Your first question isn't very clear, but you should be able to use either measure to find a distance between two vectors regardless of whether you're comparing documents or your "models" (which would more traditionally be described as clusters, where the model is the sum of all clusters).

Tyson
  • 1,685
  • 15
  • 36
5

Computational time wise (in python):

import time
import numpy as np

for i in range(10):
    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b))
    print 'Cosine similarity took', time.time() - start

    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        2 * (1 - np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b)))
    print 'Euclidean from 2*(1 - cosine_similarity) took', time.time() - start


    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.linalg.norm(a-b)
    print 'Euclidean Distance using np.linalg.norm() took', time.time() - start


    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.sqrt(np.sum((a-b)**2))
    print 'Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took', time.time() - start
    print '--------------------------------------------------------'

[out]:

Cosine similarity took 0.15826010704
Euclidean from 2*(1 - cosine_similarity) took 0.179041862488
Euclidean Distance using np.linalg.norm() took 0.10684299469
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.113723039627
--------------------------------------------------------
Cosine similarity took 0.161732912064
Euclidean from 2*(1 - cosine_similarity) took 0.178358793259
Euclidean Distance using np.linalg.norm() took 0.107393980026
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111194849014
--------------------------------------------------------
Cosine similarity took 0.16274189949
Euclidean from 2*(1 - cosine_similarity) took 0.178978919983
Euclidean Distance using np.linalg.norm() took 0.106336116791
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111373186111
--------------------------------------------------------
Cosine similarity took 0.161939144135
Euclidean from 2*(1 - cosine_similarity) took 0.177414178848
Euclidean Distance using np.linalg.norm() took 0.106301784515
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.11181807518
--------------------------------------------------------
Cosine similarity took 0.162333965302
Euclidean from 2*(1 - cosine_similarity) took 0.177582979202
Euclidean Distance using np.linalg.norm() took 0.105742931366
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111120939255
--------------------------------------------------------
Cosine similarity took 0.16153883934
Euclidean from 2*(1 - cosine_similarity) took 0.176836967468
Euclidean Distance using np.linalg.norm() took 0.106392860413
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110891103745
--------------------------------------------------------
Cosine similarity took 0.16018986702
Euclidean from 2*(1 - cosine_similarity) took 0.177738189697
Euclidean Distance using np.linalg.norm() took 0.105060100555
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110497951508
--------------------------------------------------------
Cosine similarity took 0.159607887268
Euclidean from 2*(1 - cosine_similarity) took 0.178565979004
Euclidean Distance using np.linalg.norm() took 0.106383085251
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.11084485054
--------------------------------------------------------
Cosine similarity took 0.161075115204
Euclidean from 2*(1 - cosine_similarity) took 0.177822828293
Euclidean Distance using np.linalg.norm() took 0.106630086899
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110257148743
--------------------------------------------------------
Cosine similarity took 0.161051988602
Euclidean from 2*(1 - cosine_similarity) took 0.181928873062
Euclidean Distance using np.linalg.norm() took 0.106360197067
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111301898956
--------------------------------------------------------
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    so computation time wise, you are saying Euclidean is better..? – aerin Jun 13 '17 at 19:02
  • 2
    From these results, there seem to be no significant difference between the computation time. Hence, one can not be guided by the computation time in deciding the method to use. – Gathide Nov 27 '17 at 04:09
0

I suggest that the only sure way to determine which distance measure is better in a given application is to try both and see which one gives you more satisfactory results. I'd guess that in most cases the difference in effectiveness would not be great, but that might not be true in your particular application.