Is there a way to reduce the size of the sklearn 20newsgroups dataset?

Question

I am in process of learning the basics of NLP and I am trying to code the kNN classifier.

In the data preparation stage, I am trying to reduce the set size down to a certain dimension but I am confused about how to do that.

Can anyone help me out?

I have written the code below for getting the training dataset

trainingData = fetch_20newsgroups(subset="train",categories=allCategories)

"Thanks. Rahul" why is this mentioned here. Is it related to your question, doest it helps while asking or answering question? — imxitiz, Aug 31 '22 at 07:25

score 0 · Answer 1 · answered Aug 31 '22 at 07:15

What you're trying to do is renowned for Dimension Reduction which has its own variants, it the broadest sense it is divided into Supervised and Unsupervised. Any flavor of it using sklearn API would be implemented as below:

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print('Original Features: ', X.shape[1])
X_unsupervised = TSNE(n_components=2, learning_rate='auto', init='random', perplexity=3).fit_transform(X)
print('Features after Unsupervised Dimension Reduction: ', X_unsupervised.shape[1])
y = [1, 0, 0, 1]
X_supervised = SelectKBest(chi2, k=2).fit_transform(X, y)
print('Features after Supervised Dimension Reduction: ', X_supervised.shape[1])

output:

Original Features:  9
Features after Unsupervised Dimension Reduction:  2
Features after Supervised Dimension Reduction:  2

Is there a way to reduce the size of the sklearn 20newsgroups dataset?

1 Answers1