I am doing text classification now. Is there any situation the TF-IDF is worse that using term-frequency vectors? How to explain it? Thanks
Asked
Active
Viewed 1,405 times
1 Answers
0
Both metrics ...discriminate along two dimensions – informativeness (IDF) and aboutness (TF)
Documents that contain hundreds of occurrences of some high IDF term are going to result in poor, noisy matches ... in ex. spam documents
A good read - Beyond bags of words, (Donald A. Metzler Jr. 2007)

Community
- 1
- 1

Ion Cojocaru
- 2,583
- 15
- 16
-
Sorry, i don't understand discriminate informativeness (IDF) and aboutness (TF) How to explain it? Thanks – Meng Zhang Apr 04 '13 at 15:46
-
1if a frequency of a term is very high in a document, one can state that the document is about that term to a certain degree (TF) Common terms that are met in a lot of documents are considered noise (in ex: the, this, ...) they will no bring new information to the document or very little (IDF). Take some time to read the linked article it will you a better view on the matter. In most of the cases the combination of TF-IDF is better than TF alone. These are both term weighting schemes that can be applied on term vectors. Cheers – Ion Cojocaru Apr 05 '13 at 12:45
-
1@IonCojocaru I have the opposite question...is there any case when IDF is better than TF-IDF? As far I understood TF is important to give a weight to a word within a document to match that document with a predefined query. If I'd like just to sort the importance of words in a collection of documents without any specific IR purpose, why should I use the TF term? – gabboshow Mar 05 '15 at 15:43