8

I have 2 vectors with 11 dimentions.

a <- c(-0.012813841, -0.024518383, -0.002765056,  0.079496744,  0.063928973,
        0.476156960,  0.122111977,  0.322930189,  0.400701256,  0.454048860,
        0.525526219)

b <- c(0.64175768,  0.54625694,  0.40728261,  0.24819750,  0.09406221, 
       0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
       -0.07215885)

cosine_sim <- cosine(a,b)

which returns:

-0.05397935

I used cosine() from lsa package.

for some values i am getting negative cosine_sim like the given one. I am not sure how the similarity can be negative. It should be between 0 and 1.

Can anyone explain what is going on here.

smci
  • 32,567
  • 20
  • 113
  • 146
Robin
  • 81
  • 1
  • 1
  • 2
  • 3
    take a look at the wikipedia link for cosine similarity (http://en.wikipedia.org/wiki/Cosine_similarity). it cleary states the values lie between -1 and 1, with -1 indicating complete dissimilarity and 1 indicating complete similarity. – Ramnath Jul 06 '11 at 14:38
  • The clue is in the name. The trigometric cosine function can take values from -1 to 1, so you would expect this one to as well. – Richie Cotton Jul 06 '11 at 15:15
  • Same question on CrossValidated: [Is it ok to get negative Cosine Similarity using LSA?](http://stats.stackexchange.com/questions/145663/is-it-ok-to-get-negative-cosine-similarity-using-lsa) – smci Mar 30 '17 at 20:52

4 Answers4

14

The nice thing about R is that you can often dig into the functions and see for yourself what is going on. If you type cosine (without any parentheses, arguments, etc.) then R prints out the body of the function. Poking through it (which takes some practice), you can see that there is a bunch of machinery for computing the pairwise similarities of the columns of the matrix (i.e., the bit wrapped in the if (is.matrix(x) && is.null(y)) condition, but the key line of the function is

crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))

Let's pull this out and apply it to your example:

> crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
            [,1]
[1,] -0.05397935
> crossprod(a)
     [,1]
[1,]    1
> crossprod(b)
     [,1]
[1,]    1

So, you're using vectors that are already normalized, so you just have crossprod to look at. In your case this is equivalent to

> sum(a*b)
[1] -0.05397935

(for real matrix operations, crossprod is much more efficient than constructing the equivalent operation by hand).

As @Jack Maney's answer says, the dot product of two vectors (which is length(a)*length(b)*cos(a,b)) can be negative ...

For what it's worth, I suspect that the cosine function in lsa might be more easily/efficiently implemented for matrix arguments as as.dist(crossprod(x)) ...

edit: in comments on a now-deleted answer below, I suggested that the square of the cosine-distance measure might be appropriate if one wants a similarity measure on [0,1] -- this would be analogous to using the coefficient of determination (r^2) rather than the correlation coefficient (r) -- but that it might also be worth going back and thinking more carefully about the purpose/meaning of the similarity measures to be used ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
2

The cosine function returns

crossprod(a, b)/sqrt(crossprod(a) * crossprod(b))

In this case, both the terms in the denominator are 1, but crossprod(a, b) is -0.05.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
1

The cosine function can take on negative values.

0

While cosine of two vectors can take any value between -1 and +1, cosine similarity (in dicument retreival) used to take values from the [0,1] interval. The reason is simple: in the WordxDocument matrix there are no negative values, so the maximum angle of two vectors is 90 degrees, for wich the cosine is 0.

Surjan
  • 1