So I'm making a python class which calculates the tfidf weight of each word in a document. Now in my dataset I have 50 documents. In these documents many words intersect, thus having multiple same word features but with different tfidf weight. So the question is how do I sum up all the weights into one singular weight?
-
Are you asking for the formula or the method? – Drewness Mar 03 '14 at 23:05
-
The method to sum up *multiple same words feature's weight into one – gncvnvcnc Mar 03 '14 at 23:06
1 Answers
First, let's get some terminology clear. A term is a word-like unit in a corpus. A token is a term at a particular location in a particular document. There can be multiple tokens that use the same term. For example, in my answer, there are many tokens that use the term "the". But there is only one term for "the".
I think you are a little bit confused. TF-IDF style weighting functions specify how to make a per term score out of the term's token frequency in a document and the background token document frequency in the corpus for each term in a document. TF-IDF converts a document into a mapping of terms to weights. So more tokens sharing the same term in a document will increase the corresponding weight for the term, but there will only be one weight per term. There is no separate score for tokens sharing a term inside the doc.

- 9,190
- 3
- 28
- 37
-
Yep, I forgot that df meant the documents in the whole corpus not the one that the word is in. Thanks – gncvnvcnc Mar 03 '14 at 23:13