2

I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
    rfc = RandomForestClassifier(n_estimators=k)
    rfc.fit(x_train, y_train)
    y_pred = rfc.predict(x_test)
    scores.append(accuracy_score(y_test, y_pred))

import matplotlib.pyplot as plt
%matplotlib inline

# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')

accuracy vs n_estimators

Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score? Thanks.

molbdnilo
  • 64,751
  • 3
  • 43
  • 82
keenlearner
  • 83
  • 1
  • 2
  • 9
  • You should choose a value around the moment the performance starts to stabilize on the curve. You shouldn't try to choose a particular value, the differences of performances between two close values of n_estimator come from variabality due to randomness and will not be replicated to new data – ThomaS Mar 20 '20 at 07:46
  • stepwise refinement is one way to find in efficiency improvement. Try using GridSearch and cross folding to find the best parameters – Golden Lion Mar 08 '21 at 12:53

4 Answers4

0

see (https://github.com/dnishimoto/python-deep-learning/blob/master/Random%20Forest%20Tennis.ipynb) randomsearchcv example

I used RandomSearchCV to find the best params for the Random Forest Classifier

n_estimators is the number of decision trees to use.

try using XBBoost to get more accuracy.

parameter_grid={'n_estimators':[1,2,3,4,5],'max_depth':[2,4,6,8,10],'min_samples_leaf': 
[1,2,4],'max_features':[1,2,3,4,5,6,7,8]}

number_models=4
random_RandomForest_class=RandomizedSearchCV(
estimator=pipeline['clf'],
param_distributions=parameter_grid,
n_iter=number_models,
scoring='accuracy',
n_jobs=2,
cv=4,
refit=True,
return_train_score=True)

random_RandomForest_class.fit(X_train,y_train)
predictions=random_RandomForest_class.predict(X)

print("Accuracy Score",accuracy_score(y,predictions));
print("Best params",random_RandomForest_class.best_params_)
print("Best score",random_RandomForest_class.best_score_)
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
0

It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50

0

don't use gridsearch for this case - it is an overkill - also since you set parameters arbitrarily you may not end up with not the optimum number.

there is a stage_predict attribute in scikit-learn which you can measure the validation error at each stage of training to find the optimum number of trees.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

# try a big number for n_estimator
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=100)
gbrt.fit(X_train, y_train)

# calculate error on validation set
errors = [mean_squared_error(y_val, y_pred)
 for y_pred in gbrt.staged_predict(X_val)]

bst_n_estimators = np.argmin(errors) + 1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)
Areza
  • 5,623
  • 7
  • 48
  • 79
  • 1
    When you say `don't use gridsearch for this case - it is an overkill`, do you mean any time we want to choose `n_estimators` for `RandomForestClassifier` or do you mean `n_estimators` for **any** classifier? Is `staged_predict` only for `n_estimators` or can it be used for `max_depth` too? When is `gridsearch` not overkill? e.g. I have 130,000 rows. – Edison Jun 28 '22 at 10:04
  • 1
    1) def not all classifiers - not all classifier have n_estimator; 2) I said 'overkilled' because at the end what you are after is figuring out when you perform the best on validation set given one parameter (n_estimators). grid search comes handy when you have multiple parameters to search for and 3) since your data is big - perhaps just one set is enough - you can't computationally afford doing 5 fold CV. staged_predict is a less know function but it is as halal/kosher as cross-validation through grid-search – Areza Jun 28 '22 at 19:25
  • Thanks. So I'm performing GridSearchCV using XGBoostClassifier with 6 parameters. It's taking forever on an old MacBook. Do I stop it? Only do a search for `n_estimators` and `max_depth`? Use `staged_predict` just for `n_estimators`? Use `staged_predict` for `max_depth ` as well? Would Catboost or LightBGM take just as long? Forget everything and move to Colab? lol – Edison Jun 29 '22 at 00:33
  • I don't know much about your case- you macbook is not an iron - use colab. depends on your goal - are you eventually turning it into API ? if so go for catboost. you can use 2 fold cross validation too. welcome to DS world btw :-) – Areza Jun 29 '22 at 07:03
  • Thanks. It's binary classification (CSAT) with mixed inputs. I've switched over to Colab Pro + and trying the `gpu_hist` param on all these classifiers. Biggest bottlenecks appear to be the the number of estimators then the depth then the learning rate. Main challenge is not having any idea of what initial params to use in the GridSearchVC `param_grid` dictionary. Everyone says it's domain knowledge and trial and error, but it seems like it would be trivial to make a chart with basic classifier param settings based on size, shape target, domain etc. Just to get beginners up and running. – Edison Jun 29 '22 at 11:26
  • the gridsearch does create that table for you - perhaps go with that on a subset of data to get familiar with. start smart with very small range of parameters so you get output quickly; use 10% of your data; test for two parameters, and two cross-validation fold. it should be a kaggle competition where you fight for 0.00001 improvement ! – Areza Jun 29 '22 at 11:33
  • Unless there is something I am overlooking, the GridSearch creates the recommended estimators and depth etc only **after** the parameter dictionary is created e.g. `param_grid={"colsample_bytree":[0.5, 0.75, 1], "max_depth":[2, 6, 12], "learning_rate":[0.3, 0.1, 0.03], "n_estimators":[100]}.` Beginners have no way of knowing what those values should be. It is clear that size, shape, classification/regression type and analytics goal determines what range/linespace is used. It's surprising there are no tables/charts to help beginners. – Edison Jun 29 '22 at 12:14
-1

Is it only me or anyone who already answered this question, doesn't really answer your question? In case you still looking for the answer for how to get the accuracy score and the n_estimator you want. I maybe could answer it.

First, you already answer it from your code, in this lines.

scores =[]
for k in range(1, 200):
    rfc = RandomForestClassifier(n_estimators=k)
    rfc.fit(x_train, y_train)
    y_pred = rfc.predict(x_test)
    scores.append(accuracy_score(y_test, y_pred))

As you can see, you already saved the accuracy_score into scores. So you just need to recall it by find the maximum value from the socres's list.

maxs = max(scores)
maxs_idx = scores.index(maxs)

Then just put the print command in the final lines.

print(f"Accuracy Score: {maxs} with n_estimators: {maxs_idx}")

I hope your problem has already been solved. Well, I also thanks to you because your code helps me create a way to find the best estimators too.

theDreamer911
  • 85
  • 1
  • 9