0

I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?

This is a sample of my y_train values, which are log transformed.

3.688879 3.828641 3.401197 3.091042 4.624973

from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)
--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
      1 from imblearn.over_sampling import RandomOverSampler

~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     73             The corresponding label of `X_resampled`.
     74         """
---> 75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X, y)
     77         X, y, binarize_y = self._check_X_y(X, y)

~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171                       'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'continuous'
afsharov
  • 4,774
  • 2
  • 10
  • 27
DDM
  • 303
  • 4
  • 19
  • 1
    Why do you want to re-sample continuous targets? It looks like you have a regression problem at hand, as you want to perform a linear regression. Re-sampling strategies however are meant for classification problems with an imbalance in class distribution. That is why the `RandomOverSampler` will not accept `float` as the type for the targets. Re-sampling is not meant for regression problems after all. – afsharov May 18 '21 at 09:39
  • Hi thanks. Any idea on what should I do if I have a small dataset for linear regression problem? – DDM May 18 '21 at 10:30
  • A quick search lead me to this [article](https://towardsdatascience.com/repurposing-traditional-resampling-techniques-for-regression-tasks-d1a9939dab5d). The [`reg_resampler`](https://github.com/atif-hassan/Regression_ReSampling) package could be useful in your approach. Will leave an answer using that package in an example. – afsharov May 18 '21 at 11:29

1 Answers1

2

Re-sampling strategies are not meant for regression problems. Hence, the RandomOverSampler will not accept float type targets. There are approaches to re-sample data with continuous targets though. One example is the reg_resample which can be used like the following:

from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_regression
from reg_resampler import resampler
import numpy as np


# Create some dummy data for demonstration
X, y = make_regression(n_features=10)
df = np.append(X, y.reshape(100, 1), axis=1)

# Initialize the resampler object and generate pseudo-classes
rs = resampler()
y_classes = rs.fit(df, target=10)

# Now resample
X_res, y_res = rs.resample(
    sampler_obj=RandomOverSampler(random_state=27),
    trainX=df,
    trainY=y_classes
)

The resampler object will generate pseudo-classes based on your target values and then use a classic re-sampling object from the imblearn package to re-sample your data. Note that the data you pass to the resampler object should contain all data, including the targets.

afsharov
  • 4,774
  • 2
  • 10
  • 27
  • Hi does this also change the original data points since its a resampling technique? – DDM May 19 '21 at 06:52
  • Resampling does never change the original data points. If you have any concerns in that regard, you may verify this by using a small dataset and check the result. – afsharov May 19 '21 at 09:38