16

I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)

print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])

I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

pred = CV_rfc.predict_proba(x_test)[:,1]
print(roc_auc_score(y_test, pred))

However, I am not clear how to merge feature selection (rfecv) with GridSearchCV.

EDIT:

When I run the answer suggested by @Gambit I got the following error:

ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators='warn', n_jobs=None, oob_score=False,
            random_state=42, verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

I could resolve the above issue by using estimator__ in the param_grid parameter list.


My question now is How to use the selected features and parameters in x_test to verify if the model works fine with unseen data. How can I obtain the best features and train it with the optimal hyperparameters?

I am happy to provide more details if needed.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
EmJ
  • 4,398
  • 9
  • 44
  • 105

3 Answers3

16

Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).

Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.

May be you could use a different model (GradientBoostingClassifier, etc. ) for your final classification. It would be possible with the following approach:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)


from sklearn.pipeline import Pipeline

#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30, 
                                        random_state=42,
                                        class_weight="balanced") 
rfecv = RFECV(estimator=clf_featr_sele, 
              step=1, 
              cv=5, 
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10, 
                             random_state=42,
                             class_weight="balanced") 
CV_rfc = GridSearchCV(clf, 
                      param_grid={'max_depth':[2,3]},
                      cv= 5, scoring = 'roc_auc')

pipeline  = Pipeline([('feature_sele',rfecv),
                      ('clf_cv',CV_rfc)])

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

Now, you can apply this pipeline (Including feature selection) for test data.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    thanks a lot for the great answer. why do you think it is important to do feature selection using a different classifier? Is there any reason for it? Lokking forward to hearing from you. thank you very much :) – EmJ Apr 12 '19 at 02:10
  • 1
    As you know, feature selection can be done by comparatively simple classsifer. But when you want to do the final classification you would be more interested in performance and hence you might go for mlp classifier or some thing like that . – Venkatachalam Apr 12 '19 at 03:16
  • thanks a lot. just a quick question. what are the `simple classifiers` that you would recommend for feature selection? Looking forward to hearing from you :) – EmJ Apr 12 '19 at 10:47
  • 1
    I would start with logisticRegresssion, then sgdClassifier, ridgeClassifier,decisionTree, etc. – Venkatachalam Apr 12 '19 at 11:38
  • thanks a lot. what algorithms would you recommend for parameter tuning? Moreover, could you please tell me if you know answers for the following question https://stackoverflow.com/questions/55649352/how-to-run-rfecv-with-svc-in-sklearn – EmJ Apr 12 '19 at 11:42
  • it is possible to get `f1` score of `pipeline.fit(X_train, y_train)`? Looking forward to hearing from you. :) – EmJ Apr 15 '19 at 22:12
  • 1
    Hello, I applied this method but I see that the model, after running the pipeline has selected more features than what actually came from `rfecv` – aasthetic Dec 06 '20 at 14:44
  • Should not it be- pipeline = Pipeline([('feature_sele',rfecv), ('clf_cv',CV_rfc)]) CV_rfc = GridSearchCV(pipeline, param_grid={clf_cv__max_depth:[2,4]}, ...) CV_rfc.fit(X_train,Y_train) CV_rfc.predict(X_test) – rajesh May 30 '21 at 16:24
  • RFE is a wrapper for an estimator, i think doing what this answer does actually does not influence final model, in other words, RFE is not passing anything to the MODEL part, would have to validate by number of features chosen, select say only 1 forcibly by changing `n_features_` to 1 and then expand to whatever number, say 10, if scores are the same then this pipeline is not working – Yev Guyduy Jun 22 '22 at 15:15
5

You can do what you want by prefixing the names of the parameters you want to pass to the estimator with 'estimator__'.

X = df[[my_features]]
y = df[gold_standard]

clf = RandomForestClassifier(random_state=0, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc')

param_grid = { 
    'estimator__n_estimators': [200, 500],
    'estimator__max_features': ['auto', 'sqrt', 'log2'],
    'estimator__max_depth' : [4,5,6,7,8],
    'estimator__criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')

X_train, X_test, y_train, y_test = train_test_split(X, y)

CV_rfc.fit(X_train, y_train)

Output on fake data I made:

{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'}
0.5653035605690997
RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=6, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0)
gmds
  • 19,325
  • 4
  • 32
  • 58
  • thanks a lot for your great answer. could you please tell me how to use `X_test` to validate the results? Looking forward to hearing from you. Thank you very much :) – EmJ Apr 12 '19 at 00:27
  • 1
    `roc_auc_score(y_test, CV_rfc.predict_proba(X_test))`? – gmds Apr 12 '19 at 00:28
  • thanks a lot. one last question. I would like to see what are the features selected through this process. Is it possible to get those selecetd features? :) – EmJ Apr 12 '19 at 00:29
  • is it correct to get the selected number of features as `rfecv.n_features_`. please kindly correct me if I am wrong. Looking forward to hearing from you. Thank you very much :) – EmJ Apr 12 '19 at 01:24
2

You just need to pass the Recursive Feature Elimination Estimator directly into the GridSearchCV object. Something like this should work

X = df[my_features] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='auc_roc')

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

#------------- Just pass your RFECV object as estimator here directly --------#

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')


CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)
Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • 1
    thanks a lot for the great answer. Is there a way to get the selected features from `rfecv`? Moreover, how can we validate `X_test` using the selected features? Looking forward to hearing from you. Thank you very much once again :) – EmJ Apr 10 '19 at 10:09
  • 1
    I tried to run your code. however, i got the following error. `ValueError: Invalid parameter criterion for estimator`. Can you please tell me how to resolve this issue. Thank you very much :) – EmJ Apr 10 '19 at 13:00