The size of the LogicsticRegression
object isn't tied to how many samples are used to train it.
from sklearn.linear_model import LogisticRegression
import pickle
import sys
np.random.seed(0)
X, y = np.random.randn(100000, 1), np.random.randint(2, size=(100000,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))
np.random.seed(0)
X, y = np.random.randn(100, 1), np.random.randint(2, size=(100,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X, y)
print(sys.getsizeof(pickle.dumps(log_regression_model)))
results in
1230
1233
You might be saving the wrong model object. Make sure you're saving log_regression_model.
pickle.dump(log_regression_model, open('model.pkl', 'wb'))
With the model sizes so different, and the fact that LogisticRegression
objects don't change their size with different numbers of training samples, it looks like different code is being used to generate your saved model and this new "retrained" model.
All that said, it also looks like warm_start isn't doing anything here:
np.random.seed(0)
X, y = np.random.randn(200, 1), np.random.randint(2, size=(200,))
log_regression_model = LogisticRegression(warm_start=True)
log_regression_model.fit(X[:100], y[:100])
print(log_regression_model.intercept_, log_regression_model.coef_)
log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)
log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X[100:], y[100:])
print(log_regression_model.intercept_, log_regression_model.coef_)
log_regression_model = LogisticRegression(warm_start=False)
log_regression_model.fit(X, y)
print(log_regression_model.intercept_, log_regression_model.coef_)
gives:
(array([ 0.01846266]), array([[-0.32172516]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.17253402]), array([[ 0.33734497]]))
(array([ 0.09707612]), array([[ 0.01501025]]))
Based on this other question, warm_start
will have some effect if you use another solver (e.g. LogisticRegression(warm_start=True, solver='sag')
), but it still won't be the same as re-training on the entire dataset with the new data added. For example, the above four outputs become:
(array([ 0.01915884]), array([[-0.32176053]]))
(array([ 0.17973458]), array([[ 0.33708208]]))
(array([ 0.17968324]), array([[ 0.33707362]]))
(array([ 0.09903978]), array([[ 0.01488605]]))
You can see the middle two lines are different, but not very different. All it does is use the parameters from the last model as a starting point for re-training the new model with the new data. It sounds like what you want to do is save the data, and re-train it with the old data and new data combined every time you add data.