4

I am performing hyperparameter tuning of RandomForest as follows using GridSearchCV.

X = np.array(df[features]) #all features
y = np.array(df['gold_standard']) #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)

The result I got is as follows.

{'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'n_estimators': 200}

Afterwards, I reapply the tuned parameters to x_test as follows.

rfc=RandomForestClassifier(random_state=42, criterion ='gini', max_depth= 6, max_features = 'auto', n_estimators = 200, class_weight = 'balanced')
rfc.fit(x_train, y_train)
pred=rfc.predict(x_test)
print(precision_recall_fscore_support(y_test,pred))
print(roc_auc_score(y_test,pred))

However, I am still not clear how to use GridSearchCV with 10-fold cross validation (i.e. not just apply the tuned parameters to x_test). i.e. something like below.

kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

OR

sinceGridSearchCV uses crossvalidation can we use all X and y and get the best result as the final result?

I am happy to provide more details if needed.

EmJ
  • 4,398
  • 9
  • 44
  • 105
  • You're asking whether you can use your test set as part of GridSearch if you do cross-validation? Doing this will ultimately provide a biased classification performance, overestimating the generalisation capabilities of your trianed classifier. Imo, your code as you have it at the moment provides the best estimate of generalisation ability. So I wouldn't change anything. – JimmyOnThePage Apr 10 '19 at 05:41

2 Answers2

3

You should not perform a grid search in this scenario.

Internally, GridSearchCV splits the dataset given to it into various training and validation subsets, and, using the hyperparameter grid provided to it, finds the single set of hyperparameters that give the best score on the validation subsets.

The point of a train-test split is then, after this process is done, to perform one final scoring on the test data, which has so far been unknown to the model, to see if your hyperparameters have been overfit to the validation subsets. If it does well, then the next step is putting the model into production/deployment.

If you perform a grid search within cross-validation, then you will have multiple sets of hyperparameters, each of which did the best on their grid-search validation sub-subset of the cross-validation split. You cannot combine these sets into a single coherent hyperparameter specification, and therefore you cannot deploy your model.

gmds
  • 19,325
  • 4
  • 32
  • 58
  • thanks a lot for the great answer. One quick question. what result should be reported if we are writing a research paper. Is it `CV_rfc.best_score_` or the value of `CV_rfc.predict(x_test)`? Looking forward to hearing from you. Thank you very much :) – EmJ Apr 10 '19 at 06:15
  • 1
    @Emi I would say it depends on your exact use case. Note that `predict` will return *predictions*, not a single score. – gmds Apr 10 '19 at 06:17
  • so, we want to get a single score from the predications, can we use it like this? `pred = CV_rfc.predict(x_test)` `print(roc_auc_score(y_test, pred))`. Please kindly correct me if I am wrong :) ` – EmJ Apr 10 '19 at 06:22
  • 1
    In the classification case, yes, I suppose. You could also look into other metrics, like f-score etc.. This question might be better asked on the Data Science or Cross Validated SE site, actually. – gmds Apr 10 '19 at 06:29
  • 1
    @Emi Actually, you should use `predict_proba` instead of `predict`, since the ROC-AUC score requires probabilities. – gmds Apr 10 '19 at 06:55
  • when I run it with `predict_proba` I get an error `ValueError: bad input shape (109, 2)`. The reason is that `predict_proba` has 2 values in a list (e.g., `array([[0.71859634, 0.28140366], [0.73036337, 0.26963663], [0.57230174, 0.42769826], [0.66975595, 0.33024405], [0.62535185, 0.37464815], [0.33822691, 0.66177309]`). Is there a way to resolve this issue? – EmJ Apr 10 '19 at 07:15
  • 1
    @Emi That's because `predict_proba` produces an array of shape `(len(x), n_classes)`. You can just slice the array with `[:1]`. – gmds Apr 10 '19 at 07:16
  • thanks a lot for your valuable feedback. I really appreciate it. Thank you very much once again :) – EmJ Apr 10 '19 at 07:29
  • please let me know if you know an answer for this: https://stackoverflow.com/questions/55609339/how-to-perform-feature-selection-with-gridsearchcv-in-python looking forward to hearing from you. Thank you very much :) – EmJ Apr 10 '19 at 09:38
2

sinceGridSearchCV uses crossvalidation can we use all X and y and get the best result as the final result?

No, you should not tune your hyper parameter (either by GridSearchCV or single gridSearch()) because model will choose the hyper parameter which can work best on the test data as well. The real purpose of test data is lost by this approach. This model performance is not generalize-able one since it has seen this data during hyper parameter tuning.

Look at this documentation for better understanding of hyper parameter tuning and cross validation.

Some pictures from documentation:

enter image description here

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • can you please tell me how to do the `final evaluation` using the tunned parameter model `best.estimator_`? :) – EmJ Apr 10 '19 at 06:44
  • 1
    use `roc_auc_score(y_test, rfc.predict_proba(x_test))` – Venkatachalam Apr 10 '19 at 06:52
  • when I run it with `predict_proba` I get an error `ValueError: bad input shape (109, 2)`. The reason is that `predict_proba` has 2 values in a list (e.g., `array([[0.71859634, 0.28140366], [0.73036337, 0.26963663], [0.57230174, 0.42769826], [0.66975595, 0.33024405], [0.62535185, 0.37464815], [0.33822691, 0.66177309]`). Is there a way to resolve this issue? – EmJ Apr 10 '19 at 07:16
  • 1
    oohh, ya sorry for not giving complete answer. Try `rfc.predict_proba(x_test)[:,1]` – Venkatachalam Apr 10 '19 at 07:17
  • 1
    I am assuming that second class is the positive class and the one for which you need the `auc_roc_score` – Venkatachalam Apr 10 '19 at 07:18
  • my data is labeled as `0` and `1`. Actually `1` is the minority class of my imbalance dataset. is it possible to check which one is `0` and `1` in `predict_proba`? Thank you :) – EmJ Apr 10 '19 at 07:24
  • 1
    `rfc.classes_ ` would give the order of classes inside the model – Venkatachalam Apr 10 '19 at 07:25
  • thanks a lot. one last question. so, we ran `rfc.predict_proba(x_test)[:,0]`, what does it say? :) – EmJ Apr 10 '19 at 07:27
  • it picks the probability for the class in the index `1`. In your case, minority class. – Venkatachalam Apr 10 '19 at 07:28
  • 1
    when I ran `rfc.classes_` I got `array([0, 1], dtype=int64)`. Actually `1` is my minority class. So, if I pick `rfc.predict_proba(x_test)[:,1]` I assumed that I am picking my minority class (the class that I am interested in). Please let me know if I am wrong. Looking forward to hearing from you :) – EmJ Apr 10 '19 at 07:32
  • thanks a lot. please let me know if you know an answer for this: https://stackoverflow.com/questions/55609339/how-to-perform-feature-selection-with-gridsearchcv-in-python looking forward to hearing from you :) – EmJ Apr 10 '19 at 09:37