Term Frequency and IDF - Clarification

Question

Based on the link , https://en.wikipedia.org/wiki/Tf%E2%80%93idf , IDF is used to negate the weightage of frequently used words in a document ( like "the" , "of" etc )

If I am applying stop words removal before extracting features , should IDF be applied ? I feel only Term Frequency would be sufficient since the repeated unimportant words are already filtered.

Please adivse

Is your question about how to implement it in spark? If yes, give more details on how your data is formatted. If what interests you is the theoretical discussion, you should as this question on http://stats.stackexchange.com/ — Wilmerton, Oct 11 '16 at 09:48
Depends what your goal is. `IDF` rewards words that are rare, and hence if two documents share a rare word, that is more significant than if they share a common one. — mtoto, Oct 11 '16 at 09:53
I have already implemented this in Spark. My concern was if IDF transformation is done to decrease weightage of frequent words ( e.g : the , of etc ) , then I may not have to do it since my text is already filtered using Stop Words Removal . — lives, Oct 11 '16 at 09:56

score 1 · Accepted Answer · answered Oct 12 '16 at 13:01

1

Even if you use stop word removal, IDF will still be useful in most cases.

I personally try to avoid stop word removal: it is language-dependent, the content of the list is arbitrary and you may remove useful words. Stopword removal is like using IDF and saying: from this cutoff point, everything above is good, everything below is useless (no "in between" zone!), which, obviously, cannot reflect the real nature of language.

But the best way to answer your question is to experiment with both approaches: if you use TF-IDF in the context of a text classification or information retrieval process, why not try test with and without IDF and see which one yields the best accuracy?

answered Oct 12 '16 at 13:01

Pascal Soucy

1,317
7
17

Yes -Thats exactly what I did. I got better accuracy after skipping IDF. I am doing only stop word removal and Term Frequency . – lives Oct 13 '16 at 12:43
It can happen in some cases particularly with text categorization. See my answer on this question: http://stackoverflow.com/questions/39152229/in-general-when-does-tf-idf-reduce-accuracy/39413780#39413780 if you want a potential explanation – Pascal Soucy Oct 13 '16 at 13:52

Term Frequency and IDF - Clarification

1 Answers1