2

I have been playing with XGBoost (specifically, the XGBRegressor) in Python. I used it to create a model with 200 estimators, max_depth of 14. It was trained on about 2M training data points. It only has 3 features and 1 output. The model is extremely accurate and I also checked it was not overfitting. But, when I save the model, it is huge! It takes up 160 MB on disk, and when I convert it to C (using Treelite) it is 490 MB on disk. I have to finally deploy it in machines without python, and it has to interface with another software which can only load C files. Such a huge size is a major implementation challenge.

Is there something I am not doing right? Or do xgboost models typically this large? I have looked around on the web, but haven't been able to find this issue.

I am using Python 3.7 and Anaconda to build my models. I am not sure what other details I should post.

akgopan
  • 41
  • 4
  • 1
    Yes. Having a large number of estimators and a high `max_depth` can create large models in my experience. I would try lowering the max depth as much as possible (without sacrificing accuaracy). For converting to `C` take a look at [this](https://stackoverflow.com/a/61468101/13253198) – gnodab May 04 '20 at 12:46
  • One more thing. 200 estimators may be overkill. You could try increasing the learning rate and decreasing the number of estimators. This has worked well for me in the past. – gnodab May 04 '20 at 12:50
  • Thanks gnodab. The max_depth came out of an exhaustive grid search in the vicinity of 14. I thought an early stop in the xgboost model should stop the n_estimators if accuracy wasn't improving. But this is good information. I will see how much room I have for sacrificing accuracy to get the model in a reasonable shape. – akgopan May 04 '20 at 13:05
  • You may not sacrifice any accuracy. In fact it is possible that it improves on your test/validation hold out set. Also, even though the model is very large. After porting to `C` it is still extremely fast. – gnodab May 04 '20 at 13:19

0 Answers0