About Sklearn double cross validation with wrapper feature_selection

Question

About Double-CV or Nested-CV.

The simplest example would be

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

gcv = GridSearchCV(RandomForestRegressor(), param_grid={"n_estimators":[5,10]})
score_ = cross_val_score(gcv , X,y,cv=5)

No question about this.

So, when using the feature_selection of the Wrapper type, there are a method of evaluating with CV (RFECV) and a method of evaluating using all data (RFE), but is RFE correct when using pipeline? This is my first question.

from sklearn.feature_selection import RFE, RFECV
rfr = RandomForestRegressor()
pipe = Pipeline([("selector", RFE(estimator=rfr)), ("estimator", rfr)])
gcv = GridSearchCV(pipe, param_grid={"estimator__n_estimators":[5,10]})
score_ = cross_val_score(gcv , X,y,cv=5)

I feel that the code below with RFECV will result in triple-CV, and the amount of calculation will increase.

from sklearn.feature_selection import RFE, RFECV
pipe = Pipeline([("selector", RFECV(rfr, cv=5)), ("estimator", rfr)])
gcv = GridSearchCV(pipe, param_grid={"estimator__n_estimators":[5,10]})
score_ = cross_val_score(gcv , X,y,cv=5)

Next, in the case of a SequentialFeatureSelector that only has a CV evaluation method, what kind of code is correct as double-CV?

from sklearn.feature_selection import SequentialFeatureSelector 

estimator_in_selector = RandomForestRegressor()

sfs = SequentialFeatureSelector (estimator_in_selector , cv=5)
pipe = Pipeline([("selector", sfs), ("estimator", rfr)])
gcv = GridSearchCV(pipe, param_grid=
{"estimator__n_estimators":[5,10]},cv=5)
score_ = cross_val_score(gcv , X,y,cv=5)

If we consider a more complicated case,

from sklearn.feature_selection import SequentialFeatureSelector 

estimator_in_selector = RandomForestRegressor()
sfs = SequentialFeatureSelector(estimator_in_selector , cv=5)
pipe = Pipeline([("selector", sfs), ("estimator", rfr)])

param_grid = {"selector__n_features_to_select":[3,5],
                "selector__estimator__n_estimators":[10,50],
                "estimator__n_estimators":[10,50]}
gcv = GridSearchCV(pipe, param_grid=param_grid)
score_ = cross_val_score(pipe , X,y,cv=5)

And also..when using genetic algorithm.

from sklearn_genetic import GAFeatureSelectionCV 
selector = GAFeatureSelectionCV(rfr, cv=5)

About Sklearn double cross validation with wrapper feature_selection

0 Answers0