3

I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.

What are some of the ways to deal with this?

rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
ananuc
  • 185
  • 1
  • 3
  • 9
  • Which version of scikit-learn are you using? Version 0.15 has some major improvements in memory consumption in the forests. – Andreas Mueller Jan 05 '15 at 23:12
  • '0.15.2'. I tried switching to GBRT which is built sequentially, somehow it runs out of memory too. Does that mean I really need to try run on a EC2 cluster or do random sampling? – ananuc Jan 06 '15 at 07:46
  • @AndreasMueller: thanks for the useful talk on Advanced Sklearn. Maybe I can try some ideas from there. I haven't got the chance to go through ogrisel's parallel ML's tutorial on EC2. I wonder beyond which point do we need to consider spinning EC2 clusters? – ananuc Jan 06 '15 at 07:55
  • Glad you liked it. as @Timo suggested you would need to adjust your parameters to work on this box. I would recommend some regularization such as "max_depth" or "max_leaf_nodes". That should reduce the memory consumption quite a bit, or reduce n_estimators. Another option would be to switch to GradientBoostingClassifier for which you might need fewer or less deep estimators, but which is sequential in training. – Andreas Mueller Jan 06 '15 at 20:52
  • When should you go to ec2? If you start with low "max_depth" and "n_estimators" it will work on your laptop. Plot how accuracy improves with more estimators or deeper trees. If it looks like it will it will improve more with more memory, or if it takes to long, consider ec2. Trying ec2 is cheap and easy btw. – Andreas Mueller Jan 06 '15 at 20:54

2 Answers2

6

Your best choice is to tune the arguments.

n_jobs=4

This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs to 2 or 1 to save memory. n_jobs==4 uses four times the memory n_jobs==1 uses.

cv=20

This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.

n_estimators = 100

Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.

To sum up, I'd recommend reducing n_jobs to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv to 10 (2-fold savings in runtime). If that does not help, change n_jobs to 1 and also reduce the number of estimators to 50 (extra two times faster processing).

Timo
  • 5,188
  • 6
  • 35
  • 38
  • 1
    As of 2021 `n_jobs` has little effect on RAM usage since there have been improvements in the sklearn library. So reducing `n_jobs` does not provide much benefits. https://stackoverflow.com/questions/23118309/scikit-learn-randomforest-memory-error – Peter Jan 22 '21 at 09:13
0

I was dealing with ~4MB dataset and Random Forest from scikit-learn with default hyper-parameters was ~50MB (so more than 10 times of the data). By setting the max_depth = 6 the memory consumption decrease 66 times. The performance of shallow Random Forest on my dataset improved! I write down this experiment in the blog post.

From my experience, in the case of regression tasks the memory usage can grow even much more, so it is important to control the tree depth. The tree depth can be controlled directly with max_depth or by tuning: min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes.

The memory of the Random Forest can be of course controlled with number of trees in the ensemble.

pplonski
  • 5,023
  • 1
  • 30
  • 34