2

I'm trying to predict the category of a news article based on 2 features: author name and article headline.

I have transformed both columns separately using CountVectorizer and TfidfTransformer. Thus, what I have now is a 3D array (ie. array of list of arrays), each row containing the [author_tfid, summary_tfid] of each data instance:

X_train = array([[array([0., 3., 0., ..., 0., 4., 0.]),
                  array([0., 0., 3., ..., 0., 0., 0.])],
                 [array([0., 0., 0., ..., 0., 0., 9.]),
                  array([1., 0., 0., ..., 0., 0., 0.])],
                 [array([2., 0., 0., ..., 0., 0., 0.]),
                  array([0., 0., 0., ..., 0., 5., 0.])],

However, when I try using imblearn's RandomOversampler.fit_transform(X_train), I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-44-210227188cde> in <module>()
----> 1 X_oversampled, y_oversampled = oversampler.fit_resample(X, y)

4 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     62     # for object dtype data, we only check for NaNs (GH-13254)
     63     elif X.dtype == np.dtype('object') and not allow_nan:
---> 64         if _object_dtype_isnan(X).any():
     65             raise ValueError("Input contains NaN")
     66 

AttributeError: 'bool' object has no attribute 'any'

Tried searching the forums and google but can't seem to find anyone with this problem. So would like to find out what's wrong / the correct way to conduct oversampling on a 3D array.

Brian
  • 33
  • 1
  • 6

1 Answers1

0

you have to pass to oversampler a 2d array. for this reason, try to concatenate on the same row author and summary features

X = np.array([[np.array([0., 3., 0., 0., 4., 0.]),
                  np.array([0., 0., 3., 0., 0., 0.])],
                 [np.array([0., 0., 0., 0., 0., 9.]),
                  np.array([1., 0., 0., 0., 0., 0.])],
                 [np.array([2., 0., 0., 0., 0., 0.]),
                  np.array([0., 0., 0., 0., 5., 0.])]])

X = X.reshape(len(X),-1)
y = np.array([0,1,1])

oversampler = RandomOverSampler(random_state=42)
X_oversampled, y_oversampled = oversampler.fit_resample(X, y)
Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54