Python Scikit-learn: Empty Vocabulary in TF-IDF

Question

I am using the code given in most up-voted answer to this question (Similarity between two text documents) to calculate TF-IDF between documents. However, I observe that when I run the code WITHOUT specifying a custom value of min_df (1, in the code), then if two documents are completely different (such that there is no common word in them), instead of receiving a TF-IDF value of 0, I get the following error:

ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).

Can somebody tell me how can I get rid of this error?

I think that instead of "a TF-IDF value of 0" but you mean "a cosine similarity of 0". TF-IDF values are vectors with size `n_features == len(vectorizer.vocabulary_)`, one vector for each document in the pair. — ogrisel, May 22 '13 at 08:13

score 3 · Accepted Answer · answered May 22 '13 at 08:20

By default (in sklearn <= 0.13) min_df is set to min_df=2 which means that each word must at least occur in 2 different documents from the corpus to be included in the vectorizer's vocabulary. While this is a reasonable choice for large corporas, it's too restrictive to get anything included in a toy dataset with just a couple of sentences, hence the error message you get which I find pretty explicit. The min_df=2 default has been changed to min_df=1 in the development branch of scikit-learn to be less confusing to new users who try the library with default parameter value on toy datasets.

Python Scikit-learn: Empty Vocabulary in TF-IDF

1 Answers1