1

I just got into learning about Decision Trees. So the questions might be a bit silly.

The idea of selecting the root node is a bit confusing. Why can't we randomly select the root node? The only difference it seems to make is that it would make the Decision Tree longer and complex, but would get the same result eventually.

Also just as an extension of the feature selection process in Decision Trees, why can't be use something as simple as correlation between features and target, or Chi-Square test to figure which Feature to start off with?

1 Answers1

0

Why can't we randomly select the root node?

We can, but this could also be extended to its child node and to child node of that child node and so on...

The only difference it seems to make is that it would make the Decision Tree longer and complex, but would get the same result eventually.

The more complex the tree is the higher variance it will have, meaning 2 things:

  • small changes in the training dataset can greatly affect the shape of the three
  • it overfits the training set

None of these is good and even if you pick a sensible choice at each step, based on entropy or gini impurity index, you will still probably end up with larger three than you would like. Yes that tree might have a good accuracy on the training set but it will probably overfit the training set.

Most of the algorithms that are using decision trees have their own ways to combat this variance, in one way or another. If you consider simple decision tree algorithm itself, the way to reduce the variance is to first train the tree and prune the tree afterwards to make it smaller and less overfitting. Random forest is solving it by averaging over large number of trees while randomly restricting which predictor can be considered for slit every time that decision has to be made.

So, randomly picking the root node will lead to the same result eventually but only on the training set and only once the overfitting is so extreme that the tree simply predicts everything with 100% accuracy. But the more the tree overfits the training set, the less accuracy it will have on a test set (in general), and we care about accuracy on the test set, not on the training set.

Matus Dubrava
  • 13,637
  • 2
  • 38
  • 54
  • Hey @Matus, that clarifies a lot. Thank You for the elaborate explanation! – Ashreet Sangotra Jul 05 '20 at 16:07
  • @DaviedZuhraph The other post is stating the same thing that I do. It just doesn't explain the implications of such a poor initial choice and it only assumes the training set. If you only consider the training set (the data that you are using to fit the model) then no matter the choice of the initial split, it will lead to a consistent hypothesis (under the assumption of an unconstrained tree). Actually, you can make all the splits random and it will still lead to a consistent hypothesis because the algorithm is allowed to have as many leaf nodes as there are samples in your dataset. – Matus Dubrava Nov 27 '20 at 12:40
  • @DaviedZuhraph Yes but only if the tree is unconstrained. If, for example, you place a constraint on your tree such that if a node has, let's say, 10 or fewer samples in it, it cannot be split anymore, then this doesn't hold anymore and you can get different results based on the choice of the initial split. Also, note that getting a consistent hypothesis on the training set doesn't tell you anything about the goodness of the model. As I have stated, such a model that categorizes each sample individually will have 100% accuracy on the training set but it will most likely be completely useless. – Matus Dubrava Nov 27 '20 at 12:56