I have a set of documents in which I am searching for my keyword. I have calculated the tf-idf values for the keyword and all the documents. Suppose, I am storing my tf-idf value in an array for all the documents, how do I use it to calculate my cosine similarity? Any kind of help with the code appreciated!
Asked
Active
Viewed 2,697 times
0
-
I'll surely work. Any help with this? – Aravind Chinta Apr 23 '12 at 12:41
1 Answers
1
You can view the array as a collection of vectors, one for each document with a number of elements equal to the number of terms. To determine the similarity of two documents, you calculate the scalar product of the corresponding vectors in the usual manner (sum of the products of the corresponding vector components) and divide it by the product of the norms of the two vectors.
It is practical to normalize the vectors before calculating the similarities. In this case, you just use the scalar product of the document vectors, as the norms will be one.

Michael J. Barber
- 24,518
- 9
- 68
- 88
-
Do i have to calculate the tf*idf for all the terms in the documents? I am just calculating the tf*idf value my keyword and the document. – Aravind Chinta Apr 23 '12 at 12:41
-
You can calculate the score for whatever vectors you like. If you want to compare to a keyword, you can view it as a fictional document containing just a single term. – Michael J. Barber Apr 23 '12 at 12:59
-