I'm trying to get words that are distinctive of certain documents using the TfIDFVectorizer class in scikit-learn. It creates a tfidf matrix with all the words and their scores in all the documents, but then it seems to count common words, as well. This is some of the code I'm running:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(contents)
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names, index=characters)
s = pd.Series(df.loc['Adam'])
s[s > 0].sort_values(ascending=False)[:10]
I expected this to return a list of distinctive words for the document 'Adam', but what it does it return a list of common words:
and 0.497077
to 0.387147
the 0.316648
of 0.298724
in 0.186404
with 0.144583
his 0.140998
I might not understand it perfectly, but as I understand it, tf-idf is supposed to find words that are distinctive of one document in a corpus, finding words that appear frequently in one document, but not in other documents. Here, and
appears frequently in other documents, so I don't know why it's returning a high value here.
The complete code I'm using to generate this is in this Jupyter notebook.
When I compute tf/idfs semi-manually, using the NLTK and computing scores for each word, I get the appropriate results. For the 'Adam' document:
fresh 0.000813
prime 0.000813
bone 0.000677
relate 0.000677
blame 0.000677
enough 0.000677
That looks about right, since these are words that appear in the 'Adam' document, but not as much in other documents in the corpus. The complete code used to generate this is in this Jupyter notebook.
Am I doing something wrong with the scikit code? Is there another way to initialize this class where it returns the right results? Of course, I can ignore stopwords by passing stop_words = 'english'
, but that doesn't really solve the problem, since common words of any sort shouldn't have high scores here.