-1

How i calculate the distance between two documents? In the k-means for numbers you have to caculate the distance between two points. I know that i can use the cosinus function. I want to perform clustering to rss documents. I have done stemming and removed the stop words from the documents. I have counted the frequency of word in each document. And now i want to implement the k-mean algorithm.

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
Mihai
  • 1
  • 2

3 Answers3

1

There various distance functions. One is the Euclidean Distance.

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
1

I'm assuming that your difficulty is in creating the feature vector? Create a feature vector for each document by

  1. Collecting together all words to form a giant vector
  2. Setting the elements of that vector to be the count of terms.

For example, if you have

Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat

Then the total set of words is [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] and the document vectors are

Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]

And now you just have two giant feature vectors that you can use to represent the document and you can use k-means clustering. As others have said, Euclidean distance can be used to calculate the distance between documents.

Jeff Foster
  • 43,770
  • 11
  • 86
  • 103
  • How do you run these document vectors through k-means? Do you have to iteratively compute the distance between each document and each other document? – alex.pilon Mar 07 '13 at 22:50
0

You can use the euclidean distance formula for an n-dimensional system.

sqrt((x1-x2)^2 + (y1-y2)^2 + (z1 - z2)^2 ... )
mrK
  • 2,208
  • 4
  • 32
  • 46