Random Forest
by nature puts a massive load on the CPU
and RAM
and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
- The Number of Attributes (features) in Dataset.
- The Number of Trees (
n_estimators
).
- The Maximum Depth of the Tree (
max_depth
).
- The Minimum Number of Samples required to be at a Leaf Node (
min_samples_leaf
).
Moreover, it's clearly stated by Scikit-learn
about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g. max_depth
, min_samples_leaf
, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn
did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth
is None
by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split
samples.
min_samples_leaf
is 1
by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
- The final and last option you have is to create your own customized
Random Forest
from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7
laptop that setting the parameter n_jobs
to -1
overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None
! Although theoretically speaking it should be the opposite!