Remove single occurrences of words in CountVectorizer

Question

I am using CountVectorizer() to create a term-frequency matrix. I want to delete the vocabulary all of the terms which a frequency of two or less. Then I use tfidfTransformer() for creating a ti*idf matrix

vectorizer=CountVectorizer()
X =vectorizer.fit_transform(docs) 

matrix_terms = np.array(vectorizer.get_feature_names())     
matrix_freq = np.asarray(X.sum(axis=0)).ravel()

tfidf_transformer=TfidfTransformer()     
tfidf_matrix = tfidf_transformer.fit_transform(X)

Then I want to use the LSA algorithm for dimensionality reduction, and k-means to clustering. But I want to make the clusters without the terms that have a frequency of two or less. Can someone help me, please?

We're going to need your code, and probably some data. See: [mcve]. — AMC, Dec 01 '19 at 01:09

Nicolas Gervais · Answer 1 · 2019-12-01T22:14:45.733

-1

You just have to keep all columns where the maximum value is less than two:

import numpy as np

count_vec = np.random.randint(0, 3, (5, 10))
print(count_vec)

[[1 1 2 0 2 2 2 0 0 2]
 [0 1 0 2 1 1 0 1 0 0]
 [0 1 0 1 0 1 1 2 2 2]
 [0 0 2 1 1 1 0 0 0 2]
 [1 0 0 2 2 2 1 1 2 2]]

Keep only columns where the highest value is lower than 2:

count_vec = count_vec[:, count_vec.max(axis=1) >= 2]
print(count_vec)

[[2 1 2 2 1 0 1 0 1]
 [1 1 0 2 0 0 2 0 1]
 [0 0 0 2 0 1 0 1 0]
 [1 0 2 0 2 2 2 1 2]
 [1 2 2 1 1 0 2 2 1]]

edited Dec 01 '19 at 22:14

answered Dec 01 '19 at 16:58

Nicolas Gervais

33,817
13
115
143

Hi, thanks for answer me. I think that max_df is about the Document Frequency of a term (DF), it tells you how many documents in the collection have a term X. I need eliminate terms of the vocabulary based on its "term- frequency" (TF), For example, i want to delete a term of the vocabulary if it has a term-frecuency of ten or less, Note that a term may appear 10 times in only a document, then its DF is one, but its TF is 10 – rootware Dec 01 '19 at 21:58
OK I get it. So you want to remove the rows or columns? Like, remove all columns where all values are lower than 1? I know how to do it if you just describe it more clearly. – Nicolas Gervais Dec 01 '19 at 22:05

Remove single occurrences of words in CountVectorizer

1 Answers1