1

I aim to integrate outlier elimination into a machine learning pipeline with a continuous dependent variable. The challenge is to keep X and y at the same length, thus I have eliminate outliers in both datasets.

As this task turned out to be difficult or impossible using sklearn, I switched to imblearn and FunctionSampler. Inspired by the documentation, I tried the following code:

from imblearn import FunctionSampler
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression

def outlier_rejection(X, y):
    model = IsolationForest(max_samples=100, contamination=0.4, random_state=rng)
    model.fit(X)
    y_pred = model.predict(X)
    
    return X[y_pred == 1], y[y_pred == 1]

pipe = make_pipeline(
    FunctionSampler(func = outlier_rejection),
    LinearRegression()
)

pipe.fit(X_train, y_train) # y_train is a continuous variable!

However, when I tried to fit the pipeline I got the error

ValueError: Unknown label type: 'continuous'

which I think is because my dependent variable is continuous.

I suspect that imblearn is only compatible with nominal data. Is that true? If yes, is there another approach to solve my problem (e.g. with classic sklearn pipeline)? If not, where did I make a mistake in the code above?

Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40

0 Answers0