0

I'm trying to get the most frequent words, and in a future, bigrams, trigrams, etc, in my corpus. I found this quiestion but it didn't work for me, and I want to avoid using zip since I want it in a more efficient way.

Up to now I have this code:

vectorizer_words = CountVectorizer(input=u'content',
                         analyzer=u'word',
                         lowercase=True,
                         stop_words=cached_stopwords,
                         strip_accents=u'unicode',
                         ngram_range=(1, 1), binary=False)

vectors = vectorizer_words.fit_transform(X, y)


N, V = vectors.shape

count_words = np.array(np.sum(vectors, axis=0))
count_words = np.squeeze(count_words)

assert count_words.shape == (V,), "count_words.shape = {}".format(count_words.shape)
words = np.array(vectorizer_words.get_feature_names())
assert words.shape[0] == V

a = count_words.argsort()[::-1]

print(words[a][:10])
print(count_words[a][:10])

plt.bar(words[a][:10], count_words[a][:10])
plt.title('title')
plt.show()

shouldn't bars be descending? (image updated)

I was expecting my graph to be descending but it does not, and I can't understand why. I'm doing something wrong (what?) or I'm misunderstanding the output?

EDIT The problem seem to be in the plt.bar: looking more carefully at the output of the following lines:

print(words[a][:10])
print(count_words[a][:10])
# Output:
['atencion' 'bien' 'mas' 'banco' 'buena' 'siempre' 'problemas' 'problema' 'tarjeta' 'rapido']
[10442  7594  6322  6121  5382  4953  4316  4202  4041  3097]

So count_words[a] is sorted as expected, but the plot is in alphabetical order (as suggested in comments, thanks!), so the problem probably is in the plot

Rodrigo Laguna
  • 1,796
  • 1
  • 26
  • 46
  • Looks like the words in the x-axis are sorted alphabetically. Are you sure you are sorting by the counts and not the words themselves? – BradMcDanel Apr 19 '18 at 05:21
  • Yes, since I'm using `np.argsort` in `count_words` and `count_words` is the result of summing in vectors (line `count_words = np.array(np.sum(vectors, axis=0))` ) so `count_words` is a vector of occurrences of words – Rodrigo Laguna Apr 19 '18 at 13:12

0 Answers0