I've got a scikit-learn classifier clf which has already been fitted with some data, and there's enough observations that the initial .fit(X, y)
took a while (>30s).
I get new datapoints every second, with which I'd like to update the model. Is it possible to just retrain the model on those new datapoints? or do I have to add the new datapoints to the old data set and then retrain the model from scratch with all the data?
I'd like to avoid retraining the model all over again, because I'm getting new data faster than the model can be re-trained which is problematic (I can expand further if there are questions, but I think the details are not relevant for the question).
Example
Training the model with the old data:
clf = MLPClassifier()
# X and y have enough observations that fitting takes >30s on my machine
clf.fit(X, y)
Obtaining new data and then retraining clf
:
new_X, new_y = gather_new_data()
# What method can I call here so I don't have to wait >30s again?
clf.update_with_new_data(new_X, new_y)
Ideally after the second code block is run, clf should behave the same as if it were retrained with both the old and new data. The only difference is the training time is large for the initial training, but small (<1s ideally) for every subsequent re-training.
I've had a look at .partial_fit()
, but it doesn't seem to work properly. The resultant model is heavily biased towards the new observations and the documentation on the method is just a single line.
This answer's link is broken.