how to selected vocabulary in scikit CountVectorizer

Question

I have used scikit CountVectorizer to convert collection of documents into matrix of token counts. I have also used its max_features which considers the top max_features ordered by term frequency across the corpus.

Now I want to analyse my selected corpus, especially I want to know frequency of tokens in selected vocabulary. But I am unable to find an easy way to do it. So kindly help me in this regard.

Bernard Jesop · Answer 1 · 2016-03-16T20:59:20.960

1

When you call fit_transform() a sparse matrix will be returned.

To display it you simply have to call the toarray() method.

vec = CountVectorizer()
spars_mat = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])

#you can observer the matrix in the interpretor by doing
spars_mat.toarray()

edited Mar 16 '16 at 20:59

answered Mar 16 '16 at 20:44

Bernard Jesop

747
6
11

score 0 · Accepted Answer · edited Mar 17 '16 at 08:19

With the help of @bernard post, I am able to completely get the result, which is as follows:

vec = CountVectorizer()
doc_term_matrix = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
doc_term_matrix = doc_term_matrix.toarray()
term_freq_matrix = doc_term_matrix.sum(0)
min_freq = np.amin(term_freq_matrix)
indices_name_mapping = vec.get_feature_names()
feature_names = [indices_name_mapping[i] for i, x in enumerate(term_freq_matrix) if x == min_freq]

how to selected vocabulary in scikit CountVectorizer

2 Answers2