RandomOverSampler doesn't seem to accept log transform as my y target variable

Question

I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?

This is a sample of my y_train values, which are log transformed.

3.688879 3.828641 3.401197 3.091042 4.624973

from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
      1 from imblearn.over_sampling import RandomOverSampler

~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     73             The corresponding label of `X_resampled`.
     74         """
---> 75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X, y)
     77         X, y, binarize_y = self._check_X_y(X, y)

~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171                       'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'continuous'

Why do you want to re-sample continuous targets? It looks like you have a regression problem at hand, as you want to perform a linear regression. Re-sampling strategies however are meant for classification problems with an imbalance in class distribution. That is why the `RandomOverSampler` will not accept `float` as the type for the targets. Re-sampling is not meant for regression problems after all. — afsharov, May 18 '21 at 09:39
Hi thanks. Any idea on what should I do if I have a small dataset for linear regression problem? — DDM, May 18 '21 at 10:30
A quick search lead me to this [article](https://towardsdatascience.com/repurposing-traditional-resampling-techniques-for-regression-tasks-d1a9939dab5d). The [`reg_resampler`](https://github.com/atif-hassan/Regression_ReSampling) package could be useful in your approach. Will leave an answer using that package in an example. — afsharov, May 18 '21 at 11:29

score 2 · Accepted Answer · answered May 18 '21 at 11:35

Re-sampling strategies are not meant for regression problems. Hence, the RandomOverSampler will not accept float type targets. There are approaches to re-sample data with continuous targets though. One example is the reg_resample which can be used like the following:

from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_regression
from reg_resampler import resampler
import numpy as np


# Create some dummy data for demonstration
X, y = make_regression(n_features=10)
df = np.append(X, y.reshape(100, 1), axis=1)

# Initialize the resampler object and generate pseudo-classes
rs = resampler()
y_classes = rs.fit(df, target=10)

# Now resample
X_res, y_res = rs.resample(
    sampler_obj=RandomOverSampler(random_state=27),
    trainX=df,
    trainY=y_classes
)

The resampler object will generate pseudo-classes based on your target values and then use a classic re-sampling object from the imblearn package to re-sample your data. Note that the data you pass to the resampler object should contain all data, including the targets.

Hi does this also change the original data points since its a resampling technique? — DDM, May 19 '21 at 06:52
Resampling does never change the original data points. If you have any concerns in that regard, you may verify this by using a small dataset and check the result. — afsharov, May 19 '21 at 09:38

RandomOverSampler doesn't seem to accept log transform as my y target variable

1 Answers1