List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

Question

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

Is there any built-in function for this?

Fred Foo · Accepted Answer · 2013-04-18T09:07:06.943

23

If cv is your CountVectorizer and X is the vectorized corpus, then

zip(cv.get_feature_names(),
    np.asarray(X.sum(axis=0)).ravel())

returns a list of (term, frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

(The little asarray + ravel dance is needed to work around some quirks in scipy.sparse.)

edited Apr 18 '13 at 09:07

answered Apr 18 '13 at 09:01

Fred Foo

355,277
75
744
836

Thanks! But they are not ordered, but I managed to do that: for tuple in sorted( occ_list ,key=lambda idx: idx[1] ): print tuple[0] +' ' + str(tuple[1]). The problem is that characters åäö are not printed out. I have set the coding to utf8. – user1506145 Apr 18 '13 at 09:47
Also are you sure that get_feature_names() will have the terms ordered according to their index in the term-frequency matrix? I have found out that cv.get_feature_names() and cv.vocabulary_.keys() does not have the same order. – user1506145 Apr 18 '13 at 10:43
3

@user1506145: `dict.keys` doesn't guarantee any order; that's exactly why `get_feature_names` exists. – Fred Foo Apr 18 '13 at 11:25
Sorry to dredge this topic up, but how would you make a _vectorized corpus_, `X`, from a simple string like "This is the example that we will make an example of." – user1717828 May 05 '17 at 20:58

score 5 · Answer 2 · answered Jan 28 '18 at 18:45

There is no built-in. I have found a faster way to do it based on Ando Saabas's answer:

from sklearn.feature_extraction.text import CountVectorizer 
texts = ["Hello world", "Python makes a better world"]
vec = CountVectorizer().fit(texts)
bag_of_words = vec.transform(texts)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
sorted(words_freq, key = lambda x: x[1], reverse=True)

output

[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

2 Answers2

Linked

Related