1

When I fetch 20newsgroups_vectorized data by

newsgroups = fetch_20newsgroups_vectorized(subset='all')
labels = newsgroups.target_names
target = newsgroups.target
target = pd.DataFrame([labels[i] for i in target], columns=['label'])
data = newsgroups.data

data is the <class 'scipy.sparse.csr.csr_matrix'> with the shape (18846, 130107)

How can I subset the data by target names (for example, extract only 'rec.sport.baseball') and use vector operations on those sparse row vectors (for example, calculate the mean vector or the distances)?

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
Rosa
  • 155
  • 10

1 Answers1

1

Unfortunately, subsetting the data by target names option is not available in fetch_20newsgroups_vectorized but it is available in fetch_20newsgroups, just that you have to vectorize the data yourself.

Here is how you can do it.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
newsgroups_train = fetch_20newsgroups(subset='all',
                                      categories=['rec.sport.baseball'])
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
print(vectors.shape)
# (994, 13986)

Read more here

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77