I am using CountVectorizer()
to create a term-frequency matrix. I want to delete the vocabulary all of the terms which a frequency of two or less.
Then I use tfidfTransformer()
for creating a ti*idf matrix
vectorizer=CountVectorizer()
X =vectorizer.fit_transform(docs)
matrix_terms = np.array(vectorizer.get_feature_names())
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
tfidf_transformer=TfidfTransformer()
tfidf_matrix = tfidf_transformer.fit_transform(X)
Then I want to use the LSA algorithm for dimensionality reduction, and k-means to clustering. But I want to make the clusters without the terms that have a frequency of two or less. Can someone help me, please?