(Text Classification) Handling same words but from different documents [TFIDF ]

Question

So I'm making a python class which calculates the tfidf weight of each word in a document. Now in my dataset I have 50 documents. In these documents many words intersect, thus having multiple same word features but with different tfidf weight. So the question is how do I sum up all the weights into one singular weight?

The method to sum up *multiple same words feature's weight into one — gncvnvcnc, Mar 03 '14 at 23:06

Rob Neuhaus · Accepted Answer · 2014-03-04T13:36:36.097

First, let's get some terminology clear. A term is a word-like unit in a corpus. A token is a term at a particular location in a particular document. There can be multiple tokens that use the same term. For example, in my answer, there are many tokens that use the term "the". But there is only one term for "the".

I think you are a little bit confused. TF-IDF style weighting functions specify how to make a per term score out of the term's token frequency in a document and the background token document frequency in the corpus for each term in a document. TF-IDF converts a document into a mapping of terms to weights. So more tokens sharing the same term in a document will increase the corresponding weight for the term, but there will only be one weight per term. There is no separate score for tokens sharing a term inside the doc.

Yep, I forgot that df meant the documents in the whole corpus not the one that the word is in. Thanks — gncvnvcnc, Mar 03 '14 at 23:13

(Text Classification) Handling same words but from different documents [TFIDF ]

1 Answers1