6

I'm trying to build a dictionary of words using tf-idf. However, intuitively it doesn't make sense.

If the inverse document frequency (idf) part of tf-idf calculates the relevance of a term with respect to entire corpus, then that implies some of the important words might have a lower relevance.

If we look at a corpus of legal documents, a term like "license" or "legal" might occur in every document. Due to idf, the score for these terms will be very low. However, intuitively speaking, these terms should have a higher score since these are clearly legal terms.

Is tf-idf a bad approach for building a dictionary of terms?

nbro
  • 15,395
  • 32
  • 113
  • 196
jCoder
  • 203
  • 3
  • 9

1 Answers1

5

Yes, those terms are legal terms. However, tf-idf doesn't try to evaluate whether they are relevant for a specific domain. They help you in shattering documents from that domain. If a term like "legal" occurs in every document they wouldn't help a classifier to tell these documents apart. However, if you mix your legal documents with a random set of documents. You would discover that they suddenly get extremely relevant. Exactly because they would allow you to tell legal documents and the other documents apart.

In practice, they are more typically used to remove "kind-of" stop words. For example, "the" occurs in every document and doesn't carry any meaning.

Whether tf-idf is good for building a dictionary depends very much on what you want to do afterward with this dictionary.

nbro
  • 15,395
  • 32
  • 113
  • 196
CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
  • I was thinking more along the lines of creating a dictionary for all legal terms using a corpus of documents as a trainign set. But you are right, it's more helpful if I already have those terms and then separating the legal docs from non-legal ones. – jCoder Feb 17 '16 at 20:42
  • 1
    One way TFxIDF could be useful is to *isolate* the legal terms. Build a separate base of non-legal documents (Wikipedia top articles, vetted to remove legal topics?) and create your IDF values from that. Now apply that in a TFxIDF calculation of your collection of legal documents. Exclusively legal terms will have a high IDF and thus stand out, while common words which are common across the board will have a low IDF, and tend to sink to the bottom, even if the TF is high. – tripleee Feb 18 '16 at 05:19
  • Nit pick: It's TF/DF or TFxIDF where IDF is defined as 1/DF. – tripleee Feb 18 '16 at 05:19