What you're trying to do is renowned for Dimension Reduction
which has its own variants, it the broadest sense it is divided into Supervised
and Unsupervised
. Any flavor of it using sklearn
API would be implemented as below:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print('Original Features: ', X.shape[1])
X_unsupervised = TSNE(n_components=2, learning_rate='auto', init='random', perplexity=3).fit_transform(X)
print('Features after Unsupervised Dimension Reduction: ', X_unsupervised.shape[1])
y = [1, 0, 0, 1]
X_supervised = SelectKBest(chi2, k=2).fit_transform(X, y)
print('Features after Supervised Dimension Reduction: ', X_supervised.shape[1])
output:
Original Features: 9
Features after Unsupervised Dimension Reduction: 2
Features after Supervised Dimension Reduction: 2