0

For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms:

Q1: how cosine similarity used by k-means does then how the clusters are constructed.

Q2: when I use TF-IDF algo. Its produce a negative values is there any problem in my calculation?

Please use the following docs vectors is VSM (tf.idf) where all have different vector length for explanation purposes.

Doc1 (0.134636045, -0.000281926, -0.000281926, -0.000281926, -0.000281926, 0)
Doc2 (-0.002354898, 0.012411358, 0.012411358, 0.09621575, 0.3815553)
Doc3(-0.001838258, 0.009688438, 0.019376876, 0.05633028, 0.59569238, 0.103366223, 0) 

i will thank any one can give explanation about my question.

  • I'm voting to close this question as off-topic because this question appears rooted in mathematics rather than programming. This question *might* be on topic on some other math related SE sites such as MathOverflow or [Mathematics](http://math.stackexchange.com/help/on-topic), though do your own research for topicality before posting there. – HPierce Feb 07 '17 at 17:55

1 Answers1

0

Cosine similarity means you take the dot product of the vector / k mean centre rather than the Euclidean distance.

Dot product is a.xb.x + a.yb.y ... + a.zz*b.zz for all the dimensions. You generally normalize the vectors first. Then call acos() on the result.

Essentially you're dividing the results into sectors rather than into randomly-clumped clusters.

Malcolm McLean
  • 6,258
  • 1
  • 17
  • 18