Does it make sense to use both countvectorizer and tfidfvectorizer as feature vectors for text clustering with KMeans?

Question

I am trying to build out my feature vectors from my csv file which contain about 1000 comments. One of my feature vector is tfidf using scikit learn's tfidf vectorizer. Does it make sense to also use count as a feature vector or is there a better feature vector that i should use?

And if i do end up using both Countvectorizer and tfidfvectorizer as my features, how should i fit them both into my Kmeans model (specifically the km.fit() part)? For now i am only able to fit the tfidf feature vectors into the model.

here is my code:

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)

#count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
#count_vectorized=count_vectorizerfit_transform(sentence_list)

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)

elyase · Accepted Answer · 2015-01-26T05:01:44.323

12

Essentially what you are doing is finding a numeric representation of your text documents (feature engineering). In some problems the counts work better and in some others the tfidf representation is the best choice. You should really try them both. While the two representations are very similar and therefore carry approximately the same information, it could be the case that you will get better precision by using the full set of features(tfidf+counts). It is possible that you can get closer to the true model by searching in this feature space.

This is how you can horizontally stack your features:

import scipy.sparse

X = scipy.sparse.hstack([vectorized, count_vectorized])

Then you can just do:

model.fit(X, y)  # y is optional in some models

edited Jan 26 '15 at 05:01

answered Dec 16 '14 at 01:47

elyase

39,479
12
112
119

Actually, I get the error `'y' is not defined`. When i try : `X,y = scipy.sparse.hstack([vectorized, count_vectorized])` i get the error: `TypeError: 'coo_matrix' object has no attribute '__getitem__'` – jxn Dec 16 '14 at 22:20
The result of the hstack should be assigned to X not to X, y – elyase Dec 17 '14 at 13:21
You don't need y if model is KMeans. Sorry that my example code was misleading. – elyase Dec 17 '14 at 17:38
should it be `X = scipy.sparse.hstack([vectorized, count_vectorized])` instead? #missing a square bracket? – jxn Jan 26 '15 at 04:58

Does it make sense to use both countvectorizer and tfidfvectorizer as feature vectors for text clustering with KMeans?

1 Answers1

Linked