Complex nesting within imblearn pipelines

Question

I have been trying to find a solution to this but unsuccessfully so far.

I am working with some data for which I need to adopt a resampling procedure within a (scikit-learn/imblearn) pipeline, meaning that the size of both the samples and targets has to change within the pipeline. In order to do this I am using FunctionSampler from imblearn.

My problem is that the main pipeline is composed of steps which are, actually, pipelines themselves, which is giving me some problems. The code below shows an extremely simplified version of the scenario I am working in. Please note this is not the actual code I am using (the transformers/classifiers are different and many more in the original code), only the structure is similar.

# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler

def outlier_extractor(X, y):
  # just an example
  return X, y

pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
                       ("variance_threshold", VarianceThreshold()),
                       ("outlier_correction", FunctionSampler(func=outlier_extractor)),
                       ("classifier", QuadraticDiscriminantAnalysis())]) 

# definition of the feature engineering options
feature_engineering_options = [
                               Pipeline(steps=[      
                                               ("scaling", StandardScaler()), 
                                               ("PCA", PCA(n_components=3))
                                               ]),                   
                               
                               Pipeline(steps=[      # add div and prod features
                                               ("polynomial", PolynomialFeatures()), 
                                               ("kBest", SelectKBest())
                                               ])
                               ]

                               
outlier_correction_options = [
                              FunctionSampler(func=outlier_extractor),
                              
                              Pipeline(steps=[  
                                              ("center_scaling", StandardScaler()), 
                                              ("normalisation", Normalizer(norm="l2"))
                                              ])
                              ]

# definition of the parameters to optimize in the pipeline
params = [      # support vector machine
          {"feature_engineering": feature_engineering_options,        
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [SVC()],     
           "classifier__C": [0.1, 1, 10, 50],
           "classifier__kernel": ["linear", "rbf"],
         },
                # quadratic discriminant analysis
          {"feature_engineering": feature_engineering_options,
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [QuadraticDiscriminantAnalysis()]        
         }
          ]

When using GridSearchCV(pipe, param_grid=params) I receive the error TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. I know that I should unpack the pipelines, and I have also tried to follow this and this in order to solve the problem but my case seems (to me, at least) more complicated and I could not get these workarounds to work. Any help/suggestion is very much appreciated. Thanks

Another SO post about the general topic: https://stackoverflow.com/q/65652054/10495893 I would encourage you to give (preferably a slimmed-down version of) your example on the github issue you linked (https://github.com/scikit-learn-contrib/imbalanced-learn/issues/793, to put it here too). That you're setting one of the pipelines as a step in a hyperparameter search seems not to admit an easy "flattening" of the pipeline as suggested elsewhere. — Ben Reiniger, Jan 13 '22 at 15:49
@BenReiniger will do, but I do not see any possibility of further simplifying the snippet of code, as all that I have included serves the purpose of showing the minimum level of complexity required and excluding solutions that could only work in simpler scenarios. — DrMaga, Jan 17 '22 at 21:43
I think the "setting a pipeline step in a search to be a pipeline" can be served with just the `feature_engineering` step and its `_options`, so you might not need `variance_threshold` or `outlier_correction` in a MWE. (It _might_ turn out that there's a slick way to fix that issue that wouldn't extend to your full example, but I don't think so?) — Ben Reiniger, Jan 18 '22 at 00:04

Complex nesting within imblearn pipelines

0 Answers0