I'm wondering if it is possible to include scikit-learn outlier detections like isolation forests in scikit-learn's pipelines?
So the problem here is that we want to fit such an object only on the training data and do nothing on the test data. Particularly, one might want to use cross-validation here.
How could a solution look like?
Build a class that inherits from TransformerMixin (and BaseEstimator for ParameterTuning). Now define a fit_transform function that stores the state if the function has been called yet or not. If it hasn't been called yet, the function fits and predicts the outlier function on the data. If the function has been called before, the outlier detection already has been called on the training data, thus we assume that we now find the test data which we simply return.
Does such an approach have a chance to work or am I missing something here?