0

I am using auto-sklearn to generate a regression model based on some data. After running for several hours, I save the generated model to disk for later use with joblib and the generated file has a size of 2.5 GiB.

How can I reduce the file size of the saved model? I only need to be able to make predictions with the model in the future.

Hawkings
  • 533
  • 3
  • 16

1 Answers1

2

Depending on the kind of model you use, there's a strong probability that you absolutely can't. If you have such a large model, I guess it is a Neural Network or a Random Forest model. And unfortunately there are no easy way to do this, and if you do this, you will most likely decrease your accuracy.

For Neural Networks there are no way to do this, just decrease the complexity of your network. For Random Forest, you can have a look to Tree Pruning, however I don't think you will earn a significant amount of memory.

If your question was: is there anything in the model only useful for training that I can delete => maybe a few variables, nothing big enough to be worth your time (a few KB at maximum).

Jonathan DEKHTIAR
  • 3,456
  • 1
  • 21
  • 42
  • This might be the issue, as the model is a random forest. I will try to restrict neural networks and random forest to force the generation of a simpler model. I will mark your answer as accepted if it works. – Hawkings Feb 02 '18 at 11:16
  • There are two factors to take into account for random forest: Decrease the number of trees and decrease the depth of each tree. – Jonathan DEKHTIAR Feb 02 '18 at 11:22
  • I have disabled the use of random forests and neural networks and ran it again and the new model has a size of 5,6 GiB. Unless I'm missing something, I think your answer is wrong. – Hawkings Feb 06 '18 at 10:43
  • Your answer is non-sense. You can't disable the use if random forest or Neural Networks. It doesn't mean anything. I think you have misunderstood my point – Jonathan DEKHTIAR Feb 06 '18 at 12:54
  • 1
    Yes I can. I use the parameter [exclude_estimators](http://automl.github.io/auto-sklearn/stable/manual.html#restricting-the-searchspace) of AutoSklearnRegressor to disable their use. This way autosklearn cannot use them and only tries the rest of the algorithms. – Hawkings Feb 07 '18 at 09:56
  • Oh I didn't understand that you were using this automatic model. The same principals applies. You have to understand which kind of model is used and you can limit its memory footprint. Btw. using an automatic pipeline is clearly not the way to customize the results you will obtain. If you want to have a model which comply with A and B. Just do it manually. If you don't want to do it manually, just accept the result you have. – Jonathan DEKHTIAR Feb 08 '18 at 15:05