Is there a way to select top 100 or 1000 bag of words based on Tfidfvectorizer output in scikit

Question

I am trying to find top 100/1000 words based on tfidfVectorizer output of Python's scikit-learn library. Is there a way to do it using a function from the scikit libraries?

Thanks for help

top 100/1000 words based on tfidf values given by tfidf vectorizer. I tried to sum up values for every column , but indexing is not allowed in sparse representation — Harshit, Oct 28 '13 at 07:22

ogrisel · Answer 1 · 2013-10-26T21:30:32.780

0

What do you mean by top 100/1000 words? The most frequent words in a dataset? You can use the Counter class of the Python standard library to do that. No need for scikit-learn.

edited Oct 26 '13 at 21:30

answered Oct 26 '13 at 13:23

ogrisel

39,309
12
116
125

1

top 100/1000 words based on tfidf values given by tfidf vectorizer. I tried to sum up values for every column , but indexing is not allowed in sparse representation . – Harshit Oct 27 '13 at 05:57
1

@user595169 Do you mean `X.sum(0)`? – Fred Foo Oct 27 '13 at 12:11

Is there a way to select top 100 or 1000 bag of words based on Tfidfvectorizer output in scikit

1 Answers1