How to use tfidf in text classification?

Question

I have a dataset which has 300000 lines, each line of which is an article title, I want to find features like tf or tfidf of this dataset. I am able to count the words(tf) in this dataset, such as:
WORD FREQUENCE
must 10000
amazing 9999

or word percentage:
must 0.2
amazing 0.19

but how to caculate idf, I mean I need to find some features to discriminate this dataset from the others? or HOW DOES tfidf used in text classification?

You might be interested in seeing this answer https://stackoverflow.com/a/54177835/4317058 which gives a simple step-by-step tutorial on how to use `tf-idf` in python and sklearn — Sergey Bushmanov, Jan 24 '19 at 10:51

score 0 · Answer 1 · answered Jan 24 '19 at 09:56

In your case a document is a single article title. Therefore the inverse document frequency (IDF) is log(300000/num(t)). Where num(t) is the number of documents (article titles) that contain the term t.

See https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2

How to use tfidf in text classification?

1 Answers1