5

How do I retrain my existing machine learning model in sklearn python?

I have thousands of records using which I have trained my model and dumped as .pkl file using pickle. While training the model for the first time, I have used the warmStart = True parameter while creating the logistic regression object.

Sample Code:

 log_regression_model =  linear_model.LogisticRegression(warm_start = True)
 log_regression_model.fit(X, Y)
 # Saved this model as .pkl file on filesystem like pickle.dump(model,open('model.pkl', wb))

I want to keep this up to date with the new data I will getting daily. For that I am opening the existing model file and get the new data of last 24 hours and train it again./

Sample Code:

#open the model from filesystem
log_regression_model = pickle.load(open('model.pkl','rb'))
log_regression_model.fit(X, Y) # New X, Y here is data of last 24 hours only. Few hundreds records only.

But, when I retrain model by loading it from filesystem, it seems to erase the existing model which is created with thousands of records, and create the new one with few hundreds of records from last 24 hours (model with thousands records is 3MB in size on filesystem, while new retrained model is only 67KB)

I have tried using warmStart option. How do I retrain my LogisticRegression model?

Jakub Bartczuk
  • 2,317
  • 1
  • 20
  • 27
ajay_t
  • 2,347
  • 7
  • 37
  • 62
  • A question: can you not add the new data to the original one and retrain on the entire data set? As a side note I would check the following link: http://scikit-learn.org/stable/modules/scaling_strategies.html. Then I would consider the mini-batch strategy that is often used in Neural networks (you need to implement the gradient descent yourself) but for logistic regression would be very easy (check https://udata.science/2017/08/31/python-implementation-of-logistic-regression-from-scratch/). But also with this strategy you would need to do few passes over the entire dataset... – Umberto Sep 19 '17 at 08:34
  • 2
    it's not efficient to train the model again with new and old data as data is huge and with current resources, it takes more than 24 hours to train the model, – ajay_t Sep 19 '17 at 17:41

2 Answers2

9

When you use fit on trained model you basically discard all the previous information.

Scikit-learn has some models that have partial_fit method that can be used for incremental training, as in documentation.

I don't remember if it's possible to retrain Logistic Regression in sklearn, but sklearn has SGDClassifier which with loss=log runs Logistic Regression with Stochastic Gradient Descent optimization, and it has partial_fit method.

Jakub Bartczuk
  • 2,317
  • 1
  • 20
  • 27
  • There's been a minor update to sklearn's docs, 'The loss ‘log’ was deprecated in v1.1 and will be removed in version 1.3. Use loss='log_loss' which is equivalent.' – Kimbo Feb 03 '23 at 18:14
1

The size of the LogicsticRegression object isn't tied to how many samples are used to train it.

from sklearn.linear_model import LogisticRegression
import pickle
import sys

np.random.seed(0)
X, y = np.random.randn(100000, 1), np.random.randint(2, size=(100000,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))

np.random.seed(0)
X, y = np.random.randn(100, 1), np.random.randint(2, size=(100,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))

results in

1230
1233

You might be saving the wrong model object. Make sure you're saving log_regression_model.

pickle.dump(log_regression_model, open('model.pkl', 'wb'))

With the model sizes so different, and the fact that LogisticRegression objects don't change their size with different numbers of training samples, it looks like different code is being used to generate your saved model and this new "retrained" model.

All that said, it also looks like warm_start isn't doing anything here:

np.random.seed(0)
X, y = np.random.randn(200, 1), np.random.randint(2, size=(200,))

log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X[:100], y[:100])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)

log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X, y)
print(log_regression_model.intercept_, log_regression_model.coef_)

gives:

(array([ 0.01846266]), array([[-0.32172516]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.09707612]), array([[ 0.01501025]]))

Based on this other question, warm_start will have some effect if you use another solver (e.g. LogisticRegression(warm_start=True, solver='sag')), but it still won't be the same as re-training on the entire dataset with the new data added. For example, the above four outputs become:

(array([ 0.01915884]), array([[-0.32176053]]))
(array([ 0.17973458]), array([[ 0.33708208]]))
(array([ 0.17968324]), array([[ 0.33707362]]))
(array([ 0.09903978]), array([[ 0.01488605]]))

You can see the middle two lines are different, but not very different. All it does is use the parameters from the last model as a starting point for re-training the new model with the new data. It sounds like what you want to do is save the data, and re-train it with the old data and new data combined every time you add data.

Jeremy McGibbon
  • 3,527
  • 14
  • 22