Scikit-learn NMF removing duplicate words

Asked Aug 09 '17 at 08:22

Active Sep 14 '18 at 07:04

Viewed 612 times

I'm using scikit-learn's nmf algorithm to extract trending words from some blogs. For example I have "game thrones"( which is good although "of" is dropped as stopword ), but I also have "game" and "thrones". I have "marcus hutchins"(good) but I also have "marcus" and "hutchins" which is bad. How can I prevent duplicates? Here is what I have( variable "documents" is a list that contains post texts from blogs):

   tfidf_vectorizer = TfidfVectorizer(max_features=no_features, 
   stop_words='english', ngram_range=(1,3), min_df=3, max_df=0.95)
   tfidf = tfidf_vectorizer.fit_transform(documents)
   tfidf_feature_names = tfidf_vectorizer.get_feature_names()

   # no of topics to display
   no_topics = 5

   # Run NMF
   nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, 
   init='nndsvd').fit(tfidf)

   # no of words to display for each topic
   no_top_words = 10

edited Feb 01 '18 at 02:11

Cœur

37,241
25
195
267

asked Aug 09 '17 at 08:22

jack jack

I think the only solution is to apply this logic in your code. I don't think you can solve this through sklearn. – Stergios Aug 09 '17 at 09:20
Thanks for answer. I was thinking about writing trending word results to a list and removing dublicates there, but it is not really a good solution and costs much. – jack jack Aug 09 '17 at 10:05

Scikit-learn NMF removing duplicate words

0 Answers0