3

I am doing text classification now. Is there any situation the TF-IDF is worse that using term-frequency vectors? How to explain it? Thanks

smci
  • 32,567
  • 20
  • 113
  • 146
Meng Zhang
  • 337
  • 1
  • 4
  • 13

1 Answers1

0

Both metrics ...discriminate along two dimensions – informativeness (IDF) and aboutness (TF)

Documents that contain hundreds of occurrences of some high IDF term are going to result in poor, noisy matches ... in ex. spam documents

A good read - Beyond bags of words, (Donald A. Metzler Jr. 2007)

Community
  • 1
  • 1
Ion Cojocaru
  • 2,583
  • 15
  • 16
  • Sorry, i don't understand discriminate informativeness (IDF) and aboutness (TF) How to explain it? Thanks – Meng Zhang Apr 04 '13 at 15:46
  • 1
    if a frequency of a term is very high in a document, one can state that the document is about that term to a certain degree (TF) Common terms that are met in a lot of documents are considered noise (in ex: the, this, ...) they will no bring new information to the document or very little (IDF). Take some time to read the linked article it will you a better view on the matter. In most of the cases the combination of TF-IDF is better than TF alone. These are both term weighting schemes that can be applied on term vectors. Cheers – Ion Cojocaru Apr 05 '13 at 12:45
  • 1
    @IonCojocaru I have the opposite question...is there any case when IDF is better than TF-IDF? As far I understood TF is important to give a weight to a word within a document to match that document with a predefined query. If I'd like just to sort the importance of words in a collection of documents without any specific IR purpose, why should I use the TF term? – gabboshow Mar 05 '15 at 15:43