Is there any situation the TF-IDF is worse that using term-frequency vectors?

Question

I am doing text classification now. Is there any situation the TF-IDF is worse that using term-frequency vectors? How to explain it? Thanks

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

Both metrics ...discriminate along two dimensions – informativeness (IDF) and aboutness (TF)

Documents that contain hundreds of occurrences of some high IDF term are going to result in poor, noisy matches ... in ex. spam documents

A good read - Beyond bags of words, (Donald A. Metzler Jr. 2007)

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 04 '13 at 12:23

Ion Cojocaru

2,583
15
16

Sorry, i don't understand discriminate informativeness (IDF) and aboutness (TF) How to explain it? Thanks – Meng Zhang Apr 04 '13 at 15:46
1

if a frequency of a term is very high in a document, one can state that the document is about that term to a certain degree (TF) Common terms that are met in a lot of documents are considered noise (in ex: the, this, ...) they will no bring new information to the document or very little (IDF). Take some time to read the linked article it will you a better view on the matter. In most of the cases the combination of TF-IDF is better than TF alone. These are both term weighting schemes that can be applied on term vectors. Cheers – Ion Cojocaru Apr 05 '13 at 12:45
1

@IonCojocaru I have the opposite question...is there any case when IDF is better than TF-IDF? As far I understood TF is important to give a weight to a word within a document to match that document with a predefined query. If I'd like just to sort the importance of words in a collection of documents without any specific IR purpose, why should I use the TF term? – gabboshow Mar 05 '15 at 15:43

Is there any situation the TF-IDF is worse that using term-frequency vectors?

1 Answers1