0

I have used scikit CountVectorizer to convert collection of documents into matrix of token counts. I have also used its max_features which considers the top max_features ordered by term frequency across the corpus.

Now I want to analyse my selected corpus, especially I want to know frequency of tokens in selected vocabulary. But I am unable to find an easy way to do it. So kindly help me in this regard.

Shweta
  • 1,111
  • 3
  • 15
  • 30

2 Answers2

1

When you call fit_transform() a sparse matrix will be returned.

To display it you simply have to call the toarray() method.

vec = CountVectorizer()
spars_mat = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])

#you can observer the matrix in the interpretor by doing
spars_mat.toarray()
Bernard Jesop
  • 747
  • 6
  • 11
0

With the help of @bernard post, I am able to completely get the result, which is as follows:

vec = CountVectorizer()
doc_term_matrix = vec.fit_transform(['toto titi', 'toto toto', 'titi tata'])
doc_term_matrix = doc_term_matrix.toarray()
term_freq_matrix = doc_term_matrix.sum(0)
min_freq = np.amin(term_freq_matrix)
indices_name_mapping = vec.get_feature_names()
feature_names = [indices_name_mapping[i] for i, x in enumerate(term_freq_matrix) if x == min_freq]
Bernard Jesop
  • 747
  • 6
  • 11
Shweta
  • 1,111
  • 3
  • 15
  • 30