During a NLP process, I transform a corpus of texts using TF-IDF which yields a scipy.sparse.csr.csr_matrix
.
I then split this data into train and test corpus and resample my train corpus in order to tackle a class imbalance problem.
The issue I'm facing is that when I use the resampled index (from the label which is of type pandas.Series
) to resample the sparse matrix like this:
tfs[Ytr_resample.index]
It takes a lot of time, and outputs the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-dd1413907d77> in <module>()
----> 1 tfs[Ytr_cat_resample.index]
/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py in __getitem__(self, key)
348 csr_sample_values(self.shape[0], self.shape[1],
349 self.indptr, self.indices, self.data,
--> 350 num_samples, row.ravel(), col.ravel(), val)
351 if row.ndim == 1:
352 # row and col are 1d
ValueError: could not convert integer scalar
Following this thread I checked that the biggest element in the index wouldn't be bigger than the number of rows in my sparse matrix.
The problem seems to come from the fact that the index is coded in np.int64
and not in np.int32
. Indeed the following works:
Xtr_resample = tfs[[np.int32(ind) for ind in Ytr_resample.index]]
Therefore I have two questions:
- Is the error actually coming from this conversion
int32
toint64
? - Is there a more pythonic way to convert the index type? (
Ytr_resample.index.astype(np.int32)
does not seem to change the type for some reason)
EDIT:
Ytr_resample.index
is of type pandas.core.indexes.numeric.Int64Index
:
Int64Index([1484, 753, 1587, 1494, 357, 1484, 84, 823, 424, 424,
...
1558, 1619, 1317, 1635, 537, 1206, 1152, 1635, 1206, 131],
dtype='int64', length=4840)
I created Ytr_resample
by resampling Ytr
(which is pandas.Series
) such that every category present in Ytr
has the same number of elements (by oversampling):
n_samples = Ytr.value_counts(dropna = False).max()
Ytr_resample = pd.DataFrame(Ytr).groupby('cat').apply(\
lambda x: x.sample(n_samples,replace = True,random_state=42)).cat