2

I'm using scikit-learn's nmf algorithm to extract trending words from some blogs. For example I have "game thrones"( which is good although "of" is dropped as stopword ), but I also have "game" and "thrones". I have "marcus hutchins"(good) but I also have "marcus" and "hutchins" which is bad. How can I prevent duplicates? Here is what I have( variable "documents" is a list that contains post texts from blogs):

   tfidf_vectorizer = TfidfVectorizer(max_features=no_features, 
   stop_words='english', ngram_range=(1,3), min_df=3, max_df=0.95)
   tfidf = tfidf_vectorizer.fit_transform(documents)
   tfidf_feature_names = tfidf_vectorizer.get_feature_names()

   # no of topics to display
   no_topics = 5

   # Run NMF
   nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, 
   init='nndsvd').fit(tfidf)

   # no of words to display for each topic
   no_top_words = 10
Cœur
  • 37,241
  • 25
  • 195
  • 267
jack jack
  • 21
  • 1
  • I think the only solution is to apply this logic in your code. I don't think you can solve this through sklearn. – Stergios Aug 09 '17 at 09:20
  • Thanks for answer. I was thinking about writing trending word results to a list and removing dublicates there, but it is not really a good solution and costs much. – jack jack Aug 09 '17 at 10:05

0 Answers0