Optimize Random Forest regressor due to computational limits

Question

Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?

I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.

I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.

clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)

train_x.info() shows 3557572 records taking up almost 542 MB of memory

I'm still getting started with ML and any help would be appreciated. Thank you!

Yahya · Accepted Answer · 2018-11-23T20:25:37.753

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.

Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:

The Number of Attributes (features) in Dataset.
The Number of Trees (n_estimators).
The Maximum Depth of the Tree (max_depth).
The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).

Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

What to Do?

There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).

Rather you need to change the value of the above mentioned parameters, for example:

Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

So try to change the the parameters by understanding their effects on the performance, the reference you need is this.

The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!

Side-Note:

Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!

Thanks a lot for the detailed and helpful reply. I was able to get it running in almost the capacity by reducing the `n_estimators` greatly! But will keep the above in mind while using RF from Scikit-learn. — specbug, Nov 24 '18 at 06:43

Optimize Random Forest regressor due to computational limits

1 Answers1

What to Do?