1

I've been training sklearn Random Forests on a regular basis over the last few months. I've noticed that when exporting the model to a file using joblib the filesize has increased dramatically - from 2.5 GB up to 11GB. All the parameters have remained the same, and the number of training features has remained fixed. The only difference is that the number of examples in the training data has increased.

Given the parameters have remained fixed, and the number of estimators and the depth of each tree is specified, why would increasing the number of examples have the effect of increasing the size of the Random Forest?

Here are the parameters for the model:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
        max_depth=None, max_features='sqrt', max_leaf_nodes=None,
        min_impurity_decrease=0.0, min_impurity_split=None,
        min_samples_leaf=20, min_samples_split=2,
        min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
        oob_score=False, random_state=123, verbose=0, warm_start=False)
thornate
  • 4,902
  • 9
  • 39
  • 43
  • 1
    In short, each tree in the forest contains all of your training samples at its leaves since you set max_depth to None. So as the training data grow, the tree goes deeper and hence bigger in term of storage. – Quang Hoang Jan 04 '19 at 23:05
  • Not all of the samples, min_samples_leaf=20 will limit the depth a little bit. – Jon Nordby Jan 08 '19 at 23:28

1 Answers1

2

I would set min_samples_leaf as a floating point, then it is a percentage of your training dataset. For instance min_samples_leaf=0.01 for at least 1% samples in each leaf.

To optimize the size of your model you can use a GridSearchCV on min_samples_leaf and n_estimators. Unless you have a very large amount of classes and features, you can probably reduce model size by a couple orders of magnitude.

Jon Nordby
  • 5,494
  • 1
  • 21
  • 50