What is the proper way to perform pruning in Random Forest Regressor?

Question

Context of my problem:

I'm performing hyperparameter tuning using GridSearchCV from scikit-learn in mt random forest regressor. To alleviate overfitting, I found that maybe I should use the pruning technique. I checked in the docs and I found ccp_alpha parameter that refers to pruning; and I also found this example that tells about pruning in the decision tree.

My question:

Since I'm looking for the best parameters of the random forest (GRidSeachCV), how should I input the ccp_alpha value? Should I include before or after the GridSearchCV? Considering that every time that I perform GridSearchCV the structure of the model changes... Are you guys have some reference? articles?

My point of view:

For me makes more sense to perform hyperparameter tuning first and then add the ccp_alpha (pruning) before train and test this "best model", but I'm not sure....

Pls notice that SO is about *specific coding* questions; non-coding questions about machine learning theory & methodology are off-topic here, and should be posted at [Cross Validated](https://stats.stackexchange.com/help/on-topic) instead. Notice the **NOTE** in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). Also, pls **re-read** [What topics can I ask about here?](https://stackoverflow.com/help/on-topic), and notice that questions asking us to *recommend or find a book, tool, software library, tutorial or other off-site resource* are off-topic for SO. — desertnaut, Aug 26 '20 at 16:42
That said, the answer below is **correct**, i.e. if you are to experiment with different `ccp_alpha` values, you should make them part of your CV procedure - after all, it *is* a hyperparameter (notice that the original RF algorithm was proposed with unpruned trees). Your primary objective is a low test MSE/MAE, *not* to avoid overfitting (plus there's always the danger of *underfitting*). — desertnaut, Aug 26 '20 at 16:56

score 3 · Answer 1 · answered Aug 14 '20 at 05:47

3

Since ccp_alpha is also a parameter to tune, it should be a part of your CV. Your other parameters depend on that too.

It is a regularization parameter (like lambda in Lasso/Ridge regression) thus a high value gives you very small trees.

answered Aug 14 '20 at 05:47

CutePoison

4,679
5
28
63

Thanks for your answer @CutePoison!! But I was thinking that using ccp_alpha as parameter, the best model will always be the one with the smallest value of ccp_alpha. To confirm it I tried many times and I verify it. But using the smallest ccp_alpha I get the best estimation but also the model continues to overfit...If I increase the ccp_alpha the overfit reduces, but the result got a little bit worse. So maybe the best approach is to tune without ccp_alpha parameter, given the possibility to get deep trees, and then prune after that, instead of prune and tune at the same time,but I'm not sure – Paulo Nishimoto Aug 17 '20 at 14:26
Your training error would of course be lowest with the smallest value - is it your validation error? – CutePoison Aug 17 '20 at 14:40
My validation error follows the behavior of the test set....So, with small values of ccap_alpha, I got small value for MAE metric for train set and high values of MAE for validation and test set. – Paulo Nishimoto Aug 18 '20 at 13:25
Is it a classification/regression task? And what is your error-value exactly? – CutePoison Aug 18 '20 at 13:32
It's a regression task. With the default model, "ccp_alpha = 0" I got MAE for train set = 0.52 and for test set MAE =1.55. Increasing the ccp_alpha to 0.05 I can achieve MAE for train set = 1.19 and for test set MAE = 1.6...If I increase ccp_alpha even more, I will reduce the difference between train and test, of course, but eventually, I'm gonna get worse results in test set as well because the model tends to be more "conservative" – Paulo Nishimoto Aug 18 '20 at 17:58
As I wrote in the description of my question, I believe that considering this is a pos-pruning approach I should tune the hyperparameter first, not considering the ccp_alpha...Then, I use ccp_alpha to prune my "best model".....I'm just not sure if this is right :x – Paulo Nishimoto Aug 18 '20 at 18:04

What is the proper way to perform pruning in Random Forest Regressor?

1 Answers1