1

During a NLP process, I transform a corpus of texts using TF-IDF which yields a scipy.sparse.csr.csr_matrix.

I then split this data into train and test corpus and resample my train corpus in order to tackle a class imbalance problem.

The issue I'm facing is that when I use the resampled index (from the label which is of type pandas.Series) to resample the sparse matrix like this:

tfs[Ytr_resample.index]

It takes a lot of time, and outputs the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-dd1413907d77> in <module>()
----> 1 tfs[Ytr_cat_resample.index]

/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py in __getitem__(self, key)
    348         csr_sample_values(self.shape[0], self.shape[1],
    349                           self.indptr, self.indices, self.data,
--> 350                           num_samples, row.ravel(), col.ravel(), val)
    351         if row.ndim == 1:
    352             # row and col are 1d

ValueError: could not convert integer scalar

Following this thread I checked that the biggest element in the index wouldn't be bigger than the number of rows in my sparse matrix.

The problem seems to come from the fact that the index is coded in np.int64 and not in np.int32. Indeed the following works:

Xtr_resample = tfs[[np.int32(ind) for ind in Ytr_resample.index]]

Therefore I have two questions:

  1. Is the error actually coming from this conversion int32 to int64?
  2. Is there a more pythonic way to convert the index type? (Ytr_resample.index.astype(np.int32) does not seem to change the type for some reason)

EDIT:

Ytr_resample.index is of type pandas.core.indexes.numeric.Int64Index:

Int64Index([1484,  753, 1587, 1494,  357, 1484,   84,  823,  424,  424,
        ...
        1558, 1619, 1317, 1635,  537, 1206, 1152, 1635, 1206,  131],
       dtype='int64', length=4840)

I created Ytr_resample by resampling Ytr (which is pandas.Series) such that every category present in Ytr has the same number of elements (by oversampling):

n_samples = Ytr.value_counts(dropna = False).max()
Ytr_resample = pd.DataFrame(Ytr).groupby('cat').apply(\
                                lambda x: x.sample(n_samples,replace = True,random_state=42)).cat
ysearka
  • 3,805
  • 5
  • 20
  • 41
  • Tell us more about `Ytr_resample.index`. dtype, shape etc. – hpaulj Aug 27 '18 at 16:19
  • @hpaulj I edited my question to add the type and shape of `Ytr_resample.index` as long as the piece of code I use to produce it. Hope it helps! – ysearka Aug 28 '18 at 07:12
  • I don't know `pandas` very well, but that `values` is widely used to extract arrays from pandas objects, e.g. `Ytr_resample.index.values`. – hpaulj Aug 28 '18 at 07:29
  • It's a good idea, I tried this and it works. Although I expected some performance gain compared to redifining element by element the index into `int32`. Weirdly, it gives the exact same computation time (I tested using `timeit`). – ysearka Aug 28 '18 at 08:26

0 Answers0