I'm trying to get the most frequent words, and in a future, bigrams, trigrams, etc, in my corpus. I found this quiestion but it didn't work for me, and I want to avoid using zip since I want it in a more efficient way.
Up to now I have this code:
vectorizer_words = CountVectorizer(input=u'content',
analyzer=u'word',
lowercase=True,
stop_words=cached_stopwords,
strip_accents=u'unicode',
ngram_range=(1, 1), binary=False)
vectors = vectorizer_words.fit_transform(X, y)
N, V = vectors.shape
count_words = np.array(np.sum(vectors, axis=0))
count_words = np.squeeze(count_words)
assert count_words.shape == (V,), "count_words.shape = {}".format(count_words.shape)
words = np.array(vectorizer_words.get_feature_names())
assert words.shape[0] == V
a = count_words.argsort()[::-1]
print(words[a][:10])
print(count_words[a][:10])
plt.bar(words[a][:10], count_words[a][:10])
plt.title('title')
plt.show()
I was expecting my graph to be descending but it does not, and I can't understand why. I'm doing something wrong (what?) or I'm misunderstanding the output?
EDIT
The problem seem to be in the plt.bar
: looking more carefully at the output of the following lines:
print(words[a][:10])
print(count_words[a][:10])
# Output:
['atencion' 'bien' 'mas' 'banco' 'buena' 'siempre' 'problemas' 'problema' 'tarjeta' 'rapido']
[10442 7594 6322 6121 5382 4953 4316 4202 4041 3097]
So count_words[a] is sorted as expected, but the plot is in alphabetical order (as suggested in comments, thanks!), so the problem probably is in the plot