Find the Most common term in Scikit-learn classifier

Question

I'm following the example in Scikit learn docs where CountVectorizer is used on some dataset.

Question: count_vect.vocabulary_.viewitems() lists all the terms and their frequencies. How do you sort them by the number of occurances?

sorted( count_vect.vocabulary_.viewitems() ) does not seem to work.

Hi! maybe you would like see my answer https://stackoverflow.com/a/48490046/1093674 — Cristhian Boujon, Jan 28 '18 at 18:49

score 16 · Answer 1 · edited Feb 28 '18 at 03:18

16

vocabulary_.viewitems() does not in fact list the terms and their frequencies, instead its a mapping from terms to their indexes. The frequencies (per document) are returned by the fit_transform method, which returns a sparse (coo) matrix, where the rows are documents and columns the words (with column indexes mapped to words via vocabulary_). You can get the total frequencies for example by

matrix = count_vect.fit_transform(doc_list)
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis=0))    
# sort from largest to smallest
print sorted(freqs, key=lambda x: -x[1])

edited Feb 28 '18 at 03:18

maxymoo

35,286
11
92
119

answered Apr 29 '13 at 22:27

Ando Saabas

1,967
14
12

4

You need to replace `matrix.sum(axis=0)` to `matrix.sum(axis=0).tolist()[0]`, since matrix.sum() returns a matrix. – Zouzias Sep 28 '18 at 08:01

Find the Most common term in Scikit-learn classifier

1 Answers1

Linked