5

I'm wondering if it is possible to include scikit-learn outlier detections like isolation forests in scikit-learn's pipelines?

So the problem here is that we want to fit such an object only on the training data and do nothing on the test data. Particularly, one might want to use cross-validation here.

How could a solution look like?

Build a class that inherits from TransformerMixin (and BaseEstimator for ParameterTuning). Now define a fit_transform function that stores the state if the function has been called yet or not. If it hasn't been called yet, the function fits and predicts the outlier function on the data. If the function has been called before, the outlier detection already has been called on the training data, thus we assume that we now find the test data which we simply return.

Does such an approach have a chance to work or am I missing something here?

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • Use "Boosting" methods. The idea is basically to 'boost' weak learners to make a 'strong learner' out of them. During the learning phase it concentrates on misclassified samples. – MMF Oct 27 '16 at 12:06
  • Hmm, I'm not sure if this is what i want. I don't want to concentrate on misclassified sample of the data, I just want to get rid of them. I want to do something like RANSAC but penalizing certain coefficients later. To this end, I jsut want do determine rows to not consider, train an elasticnet on those data but predict on new data. – Quickbeam2k1 Oct 27 '16 at 12:50
  • I get it, I'll try to right to you a comprehensive answer ;) – MMF Oct 27 '16 at 13:17
  • Hi @Quickbeam2k1, did you manage to get this to work? I'm having the same issue. – JPN Aug 25 '17 at 18:32
  • Hey, unfortunately I didn't make any progress on this so far and I have to admit that I forgot on this issue. But since I'm currently in parental leave, I might find some time again to think about it – Quickbeam2k1 Aug 25 '17 at 18:47
  • Okay, I just skimmed some source codes. Yesterdary, I had the idea to use `fit_transform` and `transform`. The first is only called during training, the latter is called in the other case (due to the architecture of the fit_transform) function. Unfortunately, the current TransformerMixins only return [transformed `X` values](http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html). This is also [used in pipelines](https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/pipeline.py#L586-L595). So for now, there is only a negative answer. – Quickbeam2k1 Aug 26 '17 at 08:55
  • I don't realy get why the thread has been closed. A proper answer is: It's not implemented direclty in scikit-learn. However, a contrib package `imbalanced learn` provides a nice modification of the pipelines to allows for the requested use case and provides an [example](https://imbalanced-learn.org/en/stable/auto_examples/plot_outlier_rejections.html) – Quickbeam2k1 Apr 04 '19 at 10:32
  • See also this [question](https://stackoverflow.com/questions/52346725/can-i-add-outlier-detection-and-removal-to-scikit-learn-pipeline) – tgrandje Aug 27 '21 at 07:50

1 Answers1

-2

Your problem is basically the outlier detection problem. Hopefully scikit-learn provides some functions to predict whether a sample in your train set is an outlier or not.

How does it work ? If you look at the documentation, it basically says:

One common way of performing outlier detection is to assume that the regular data come from a known distribution (e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can define outlying observations as observations which stand far enough from the fit shape.

sklearn provides some functions that allow you to estimate the shape of your data. Take a look at : elliptic envelope and isolation forests.

As far as I am concerned, I prefer to use the IsolationForest algorithm that returns the anomaly score of each sample in your train set. Then you can take them off your training set.

MMF
  • 5,750
  • 3
  • 16
  • 20
  • 1
    this is clear to me, however, this does not tell me how to incorporate the outlier detection in a pipeline using a Transformer object. Maybe I should highlight **pipeline** in the title. I hoped **Transformers** was hinting enough on it. Particularly, the outlier detection function do not provide a transform method, which would be required in a pipeline – Quickbeam2k1 Oct 27 '16 at 14:02