How i calculate the distance between two documents? In the k-means for numbers you have to caculate the distance between two points. I know that i can use the cosinus function. I want to perform clustering to rss documents. I have done stemming and removed the stop words from the documents. I have counted the frequency of word in each document. And now i want to implement the k-mean algorithm.
Asked
Active
Viewed 2,067 times
3 Answers
1
There various distance functions. One is the Euclidean Distance.

Felix Kling
- 795,719
- 175
- 1,089
- 1,143
1
I'm assuming that your difficulty is in creating the feature vector? Create a feature vector for each document by
- Collecting together all words to form a giant vector
- Setting the elements of that vector to be the count of terms.
For example, if you have
Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat
Then the total set of words is [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] and the document vectors are
Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]
And now you just have two giant feature vectors that you can use to represent the document and you can use k-means clustering. As others have said, Euclidean distance can be used to calculate the distance between documents.

Jeff Foster
- 43,770
- 11
- 86
- 103
-
How do you run these document vectors through k-means? Do you have to iteratively compute the distance between each document and each other document? – alex.pilon Mar 07 '13 at 22:50
0
You can use the euclidean distance formula for an n-dimensional system.
sqrt((x1-x2)^2 + (y1-y2)^2 + (z1 - z2)^2 ... )

mrK
- 2,208
- 4
- 32
- 46