Sklearn update a classifier with one or two new observations

Question

I've got a scikit-learn classifier clf which has already been fitted with some data, and there's enough observations that the initial .fit(X, y) took a while (>30s).

I get new datapoints every second, with which I'd like to update the model. Is it possible to just retrain the model on those new datapoints? or do I have to add the new datapoints to the old data set and then retrain the model from scratch with all the data?

I'd like to avoid retraining the model all over again, because I'm getting new data faster than the model can be re-trained which is problematic (I can expand further if there are questions, but I think the details are not relevant for the question).

Example

Training the model with the old data:

clf = MLPClassifier()
# X and y have enough observations that fitting takes >30s on my machine
clf.fit(X, y)

Obtaining new data and then retraining clf:

new_X, new_y = gather_new_data()
# What method can I call here so I don't have to wait >30s again?
clf.update_with_new_data(new_X, new_y)

Ideally after the second code block is run, clf should behave the same as if it were retrained with both the old and new data. The only difference is the training time is large for the initial training, but small (<1s ideally) for every subsequent re-training.

I've had a look at .partial_fit(), but it doesn't seem to work properly. The resultant model is heavily biased towards the new observations and the documentation on the method is just a single line.

This answer's link is broken.

score 1 · Accepted Answer · answered Aug 11 '22 at 10:33

The reason why the new model is heavily biased toward the newer observations is because of the phenomenon known as catastrophic forgetting. Basically, if you train a machine learning algorithm on new observations, it will forget what it learned from older observations because the weights of the model will be updated according to the new observations. This is an open and actively researched topic in the scientific community (a.k.a incremental learning) and there hasn't been any work (to my knowledge) that can guarantee that a model will retain absolutely all previous information gained from older samples.

This is also why partial_fit() is not working as expected for you. I suggest that you schedule the training of your model periodically (once every x hours/day) on all the available data at that point in time. That is the best tradeoff that you can make in this regard.

Sklearn update a classifier with one or two new observations

Example

1 Answers1