0

I have a dataset which has 300000 lines, each line of which is an article title, I want to find features like tf or tfidf of this dataset. I am able to count the words(tf) in this dataset, such as:
WORD FREQUENCE
must 10000
amazing 9999

or word percentage:
must 0.2
amazing 0.19

but how to caculate idf, I mean I need to find some features to discriminate this dataset from the others? or HOW DOES tfidf used in text classification?

user1337896
  • 1,081
  • 1
  • 10
  • 15
  • 1
    You might be interested in seeing this answer https://stackoverflow.com/a/54177835/4317058 which gives a simple step-by-step tutorial on how to use `tf-idf` in python and sklearn – Sergey Bushmanov Jan 24 '19 at 10:51

1 Answers1

0

In your case a document is a single article title. Therefore the inverse document frequency (IDF) is log(300000/num(t)). Where num(t) is the number of documents (article titles) that contain the term t.

See https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2

kaikuchn
  • 795
  • 7
  • 15