I have been trying to find a solution to this but unsuccessfully so far.
I am working with some data for which I need to adopt a resampling procedure within a (scikit-learn/imblearn) pipeline, meaning that the size of both the samples and targets has to change within the pipeline. In order to do this I am using FunctionSampler
from imblearn
.
My problem is that the main pipeline is composed of steps which are, actually, pipelines themselves, which is giving me some problems. The code below shows an extremely simplified version of the scenario I am working in. Please note this is not the actual code I am using (the transformers/classifiers are different and many more in the original code), only the structure is similar.
# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler
def outlier_extractor(X, y):
# just an example
return X, y
pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
("variance_threshold", VarianceThreshold()),
("outlier_correction", FunctionSampler(func=outlier_extractor)),
("classifier", QuadraticDiscriminantAnalysis())])
# definition of the feature engineering options
feature_engineering_options = [
Pipeline(steps=[
("scaling", StandardScaler()),
("PCA", PCA(n_components=3))
]),
Pipeline(steps=[ # add div and prod features
("polynomial", PolynomialFeatures()),
("kBest", SelectKBest())
])
]
outlier_correction_options = [
FunctionSampler(func=outlier_extractor),
Pipeline(steps=[
("center_scaling", StandardScaler()),
("normalisation", Normalizer(norm="l2"))
])
]
# definition of the parameters to optimize in the pipeline
params = [ # support vector machine
{"feature_engineering": feature_engineering_options,
"variance_threshold__threshold": [0, 0.5, 1],
"outlier_correction": outlier_correction_options,
"classifier": [SVC()],
"classifier__C": [0.1, 1, 10, 50],
"classifier__kernel": ["linear", "rbf"],
},
# quadratic discriminant analysis
{"feature_engineering": feature_engineering_options,
"variance_threshold__threshold": [0, 0.5, 1],
"outlier_correction": outlier_correction_options,
"classifier": [QuadraticDiscriminantAnalysis()]
}
]
When using GridSearchCV(pipe, param_grid=params)
I receive the error TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. I know that I should unpack the pipelines, and I have also tried to follow this and this in order to solve the problem but my case seems (to me, at least) more complicated and I could not get these workarounds to work.
Any help/suggestion is very much appreciated. Thanks