I'm using scikit-learn's nmf algorithm to extract trending words from some blogs. For example I have "game thrones"( which is good although "of" is dropped as stopword ), but I also have "game" and "thrones". I have "marcus hutchins"(good) but I also have "marcus" and "hutchins" which is bad. How can I prevent duplicates? Here is what I have( variable "documents" is a list that contains post texts from blogs):
tfidf_vectorizer = TfidfVectorizer(max_features=no_features,
stop_words='english', ngram_range=(1,3), min_df=3, max_df=0.95)
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
# no of topics to display
no_topics = 5
# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5,
init='nndsvd').fit(tfidf)
# no of words to display for each topic
no_top_words = 10