When I fetch 20newsgroups_vectorized
data by
newsgroups = fetch_20newsgroups_vectorized(subset='all')
labels = newsgroups.target_names
target = newsgroups.target
target = pd.DataFrame([labels[i] for i in target], columns=['label'])
data = newsgroups.data
data
is the <class 'scipy.sparse.csr.csr_matrix'>
with the shape
(18846, 130107)
How can I subset the data by target names (for example, extract only 'rec.sport.baseball'
) and use vector operations on those sparse row vectors (for example, calculate the mean vector or the distances)?