4

Yesterday I learnt that the cosine similarity, defined as

enter image description here

can effectively measure how similar two vectors are.

I find that the definition here uses the L2-norm to normalize the dot product of A and B, what I am interested in is that why not use the L1-norm of A and B in the denominator?

My teacher told me that if I use the L1-norm in the denominator, then cosine similarity would not be 1 if A=B. Then, I further ask him, if I modify the cosine similarity definition as follows, what the advantages and disadvantages the modified model are, as compared with the original model?

sim(A,B) = (A * B) / (||A||1 * ||B||1) if A!=B

sim(A,B) = 1 if A==B

I would appreciate if someone could give me some more explanations.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
John Smith
  • 617
  • 3
  • 16

1 Answers1

10

If you used L1-norm, your are not computing the cosine anymore.

Cosine is a geometrical concept, not a random definition. There is a whole string of mathematics attached to it. If you used the L1, you are not measuring angles anymore.

See also: Wikipedia: Trigonometric functions - Cosine

Note that cosine is monotone to Euclidean distance on L2 normalized vectors.

Euclidean(x,y)^2 = sum( (x-y)^2 ) = sum(x^2) + sum(y^2) - 2 sum(x*y)

if x and y are L2 normalized, then sum(x^2)=sum(y^2)=1, and then

Euclidean(x_norm,y_norm)^2 = 2 * (1 - sum(x_norm*y_norm)) = 2 * (1 - cossim(x,y))

So using cosine similarity essentially means standardizing your data to unit length. But there are also computational benefits associated with this, as sum(x*y) is cheaper to compute for sparse data.

If you L2 normalized your data, then

Euclidean(x_norm, y_norm) = sqrt(2) * sqrt(1-cossim(x,y))

For the second part of your question: fixing L1 norm isn't that easy. Consider the vectors (1,1) and (2,2). Obviously, these two vectors have the same angle, and thus should have cosine similarity 1.

Using your equation, they would have similarity (2+2)/(2*4) = 0.5

Looking at the vectors (0,1) and (0,2) - where most people agree they should have a similar similarity than above example (and where cosine indeed gives the same similarity), your equation yields (0+2)/(1+2) = 0.6666.... So your similarity does not match any intuition, does it?

Yuan JI
  • 2,927
  • 2
  • 20
  • 29
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • thanks for your interpretation. What I feel L1 norm could be used is based on the following example: Doc1 has (I, love, you) and Doc2 has (you). I feel (you) is the common words shared between Doc1 and Doc2. The probability of Doc1 choosing (you) is 1/3. Then the similarity between Doc1 and Doc2 seems 1/3 to me, but if cosine sim is used the similarity would be 1/sqrt(3). Could you tell me why 1/sqrt(3) is better than 1/3 in my example? Thanks. – John Smith Aug 22 '14 at 16:24
  • Well, that is how the angle between (0,0,1) and (1,1,1) is defined... if you want a different distance metric - for example Jaccard - that is fine; but it's not an angle anymore, but e.g. **set intersection size**; or if you are interested in probabilistic distances ("probability of Doc1 choosing ...") then look at divergence measures and \chi^2 distance. These exist, but they have a different intuition and name. – Has QUIT--Anony-Mousse Aug 25 '14 at 09:10