1

I am analyzing RandomForestClasifier and need some help.

max_features parameter gives the max no of features for split in random forest which is generally defined as sqrt(n_features). If m is sqrt of n, then no of combinations for DT formation is nCm. What if nCm is less than n_estimators (no of decision trees in random forest)?

example: For n = 7, max_features is 3, so nCm is 35, meaning 35 unique combinations of features for decision trees. Now for n_estimators = 100, will the remaining 65 trees have repeated combination of features? If so, won't trees be correlated introducing bias in the answer?

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77

1 Answers1

2
  1. max_features parameters sets the maximum number of features to be used at each split. Hence, if there are p number of nodes, .

  2. max_samples enforces sampling on datapoints from X. By default, it samples same size as that of the X.

From Documentation:

max_samples int or float, default=None

If bootstrap is True, the number of samples to draw from X to train each base estimator.

If None (default), then draw X.shape[0] samples.

Hence, the unique combination of tree that can be formed would be p! * nCm * (n+n-1)! / (n!(n-1)!)

For your examples, let us consider there are 10 nodes in each tree and 10 samples in your X.

10! * 7C3 * (19!/ 10! * 9!)
= 11732745024000.0

Hence, there won't be any bias for a reasonable size dataset.

Community
  • 1
  • 1
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77