0

I am using the code given in most up-voted answer to this question (Similarity between two text documents) to calculate TF-IDF between documents. However, I observe that when I run the code WITHOUT specifying a custom value of min_df (1, in the code), then if two documents are completely different (such that there is no common word in them), instead of receiving a TF-IDF value of 0, I get the following error:

ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).

Can somebody tell me how can I get rid of this error?

Community
  • 1
  • 1
Muhammad Waqar
  • 849
  • 2
  • 13
  • 29
  • I think that instead of "a TF-IDF value of 0" but you mean "a cosine similarity of 0". TF-IDF values are vectors with size `n_features == len(vectorizer.vocabulary_)`, one vector for each document in the pair. – ogrisel May 22 '13 at 08:13

1 Answers1

3

By default (in sklearn <= 0.13) min_df is set to min_df=2 which means that each word must at least occur in 2 different documents from the corpus to be included in the vectorizer's vocabulary. While this is a reasonable choice for large corporas, it's too restrictive to get anything included in a toy dataset with just a couple of sentences, hence the error message you get which I find pretty explicit. The min_df=2 default has been changed to min_df=1 in the development branch of scikit-learn to be less confusing to new users who try the library with default parameter value on toy datasets.

ogrisel
  • 39,309
  • 12
  • 116
  • 125