4

I am applying Decision Tree to a data set, using sklearn

In Sklearn there is a parameter to select the depth of the tree - dtree = DecisionTreeClassifier(max_depth=10).

My question is how the max_depth parameter helps on the model. how does high/low max_depth help in predicting the test data more accurately?

skoundin
  • 192
  • 1
  • 6
  • 19

2 Answers2

13

max_depth is what the name suggests: The maximum depth that you allow the tree to grow to. The deeper you allow, the more complex your model will become.

For training error, it is easy to see what will happen. If you increase max_depth, training error will always go down (or at least not go up).

For testing error, it gets less obvious. If you set max_depth too high, then the decision tree might simply overfit the training data without capturing useful patterns as we would like; this will cause testing error to increase. But if you set it too low, that is not good as well; then you might be giving the decision tree too little flexibility to capture the patterns and interactions in the training data. This will also cause the testing error to increase.

There is a nice golden spot in between the extremes of too-high and too-low. Usually, the modeller would consider the max_depth as a hyper-parameter, and use some sort of grid/random search with cross-validation to find a good number for max_depth.

Cihan
  • 2,267
  • 8
  • 19
  • .@CihanCeyhan - Is it possible to print `max_depth` to understand what the default value is when it's not set? – Chetan Arvind Patil Sep 29 '18 at 23:25
  • @ChetanArvindPatil the default is no limitation on the `max_depth`, as explained in the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). "The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples." – Cihan Sep 30 '18 at 01:29
  • .@CihanCeyhan - I read the documentation. Currently for my model I don't provide `max_depth` so it should take the value that has maximum leaves. To test fitting, I want to assign `max_depth` value but need to know the maximum value that gets generated by default. This way I can play with `max_depth` value from minimum to median to maximum and test model. Is there a way to `print` this `max_depth` when the value is not assigned? Hope my question is clear. – Chetan Arvind Patil Sep 30 '18 at 02:55
  • .@CihanCeyhan , I had the same problem. See answer here... https://stackoverflow.com/questions/54499114/using-sklearn-how-do-i-find-depth-of-a-decision-tree – Mel Feb 03 '19 at 03:27
2

if you are interested in the best precision according to max_depth you can look at this

L = []
for i in range(1,n):
    dtree = DecisionTreeClassifier(max_depth=i)
    dtree.fit(X_train,y_train)
    y_pred = dtree.predict(X_test)
    L.append(round(accuracy_score(y_test,y_pred),4))
print(L.index(max(L)))
print(max(L))
  • 'n' it's up to you to look at the value you don't want to exceed to avoid overfitting I advise you not to increase this value too much*
  • Great way to get a list of accuracy depending the number of depths. Helpful to decide what number of depths select. – mrCatlost Mar 28 '23 at 06:08