4

How do we find maximum depth of Random Forest if we know the number of features ?

This is needed for regularizing random forest classifier.

mach
  • 318
  • 1
  • 5
  • 13
  • Why would you need to regularize the random forest classifier? How would you regularize? Some people get this part wrong. Fully grown tress are not over fitted, because bagging and random feature selection prevents that. – Soren Havelund Welling Oct 07 '15 at 07:31
  • When I trained it with 100% data and tested with the same full data, the accuracy was '1'. Its possible only in the case of overfitting, so I thought of regularizing it with max_depth parameter and yes it solved the problem, accuracy got increased – mach Oct 07 '15 at 07:42
  • Hey, You need to test it on a cross validation set. Obviously if you check the classifier on training set on which it is trained it would be quiet close to 100%. Please divide your training set into two parts. Training and Cross validation to check the performance. Also check correlation between features as those can also lead overfitting but your method of testing is erroneous in my humble opninion – Aditya Patel Oct 07 '15 at 10:30
  • Correlation between features is generally not a problem for random forest. Most random forest packages comes with a out-of-bag (OOB) cross validation which you get for free during training. Which package are you using? – Soren Havelund Welling Oct 07 '15 at 11:07
  • Limit or prune single trees not forests. Only if your data is super large you may limit max depth to speedup. But then it is better to lower bootstrap sample size as you both gain speed up and increased tree decorrelation. – Soren Havelund Welling Oct 07 '15 at 11:24

1 Answers1

2

I have not thought about this before. In general the trees are non-deterministic. Instead of asking what is the maximum depth? You may want to know what would be the average depth, or what is the chance of a tree has depth 20... Anyways it is possible to calculate some bounds of the maximum depth. So either a node runs out of (a)inbag samples or (b)possible splits.

(a) If inbag samples(N) is the limiting part, one could imagine a classification tree, where all samples except one are forwarded left for each split. Then the maximum depth is N-1. This outcome is highly unlikely, but possible. The minimal depth tree, where all child nodes are equally big, then the minimal depth would be ~log2(N), e.g. 16,8,4,2,1. In practice the tree depth will be somewhere in between maximal in minimal. Settings controlling minimal node size, would reduce the depth.

(b) To check if features are limiting tree depth and you on before hand know the training set, then count how many training samples are unique. Unique samples (U) cannot be split. Do to boostrapping only ~0.63 of samples will be selected for every tree. N ~ U * 0.63. Use the rules from section (a). All unique samples could be selected during bootstrapping, but that is unlikely too.

If you do not know your training set, try to estimate how many levels (L[i]) possible could be found in each feature (i) out of d features. For categorical features the answer may given. For numeric features drawn from a real distribution, there would be as many levels as there are samples. Possible unique samples would be U = L[1] * L[2] * L[3] ... * L[d].

Soren Havelund Welling
  • 1,823
  • 1
  • 16
  • 23