2

I tried to predict label of my newly added data through SGDClassifer.partial_fit as below:

from sklearn import neighbors, linear_model
import numpy as np

def train_predict():

    X = [[1, 1], [2, 2.5], [2, 6.8], [4, 7]]
    y = [1, 2, 3, 4]

    sgd_clf = linear_model.SGDClassifier(loss="log")

    sgd_clf.fit(X, y)

    X1 = [[6,9]]
    y1=[5]

    f1 = sgd_clf.partial_fit(X1,y1)

    f1.predict([[6,9]])

    return f1


if __name__ == "__main__":
    clf = train_predict()

fit perfectly predicts the labels. However, prediction with partial fit results in error as:

in compute_class_weight
    raise ValueError("classes should include all valid labels that can be in y")

similar to Sklearn SGDC partial_fit ValueError: classes should include all valid labels that can be in y, I read partial_fit manual, http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit

But i am still not able to figure out how to set the parameters of partial_fit so that i can be able to predict the data added on the fly.

Any references or ideas ?

user1
  • 391
  • 3
  • 27

1 Answers1

0

The underlying problem seems to be that your input data to partial fit is not a subset of your original data (that was input to .fit()).

That requirement is at least how I interpret the documentation for X and y in partial_fit():

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Subset of the training data

y : numpy array, shape (n_samples,)

Subset of the target values

It also becomes apparent from the error when you use your X1 and y1, with classes = np.unique(y1) (as suggested in the documentation, which yields:

ValueError: `classes=array([5])` is not the same as on last call to
    partial_fit, was: array([1, 2, 3, 4])

Which indicates partial_fit is used in fit under the hood.

The following example works:

X1 = X[2:3]
y1 = y[2:3]

classes = np.unique(y)
f1 = sgd_clf.partial_fit(X1, y1, classes=classes)

So make sure X1 and y1 are included in your original data sets.

  • Thanks @Evert, As suggested, I added data to X, y here - https://ideone.com/Tvm40m. This resulted in error `classes=array([1, 2, 3, 4, 5])` is not the same as on last call to partial_fit, was: array([1, 2, 3, 4]) – user1 Feb 21 '18 at 10:48
  • If you run the code at https://ideone.com/Tvm40m, do you get also the similar error or its working fine ? – user1 Feb 21 '18 at 11:01
  • @krishnadamarla I only see an `ImportError` for `sklearn`; that's not too helpful. –  Feb 21 '18 at 11:55
  • @krishnadamarla Anyway, the way you're doing it there is incorrect: you're adding values *after* you have done the initial fit. So the classifier will still complain about new classes/non-matching classes that weren't in the initial fit. You need to have the relevant data *before* the first call to `fit` or `partial_fit`. –  Feb 21 '18 at 11:57
  • 1
    @krishnadamarla Aside, if you are interested only in the last elements, you probably don't want to use the `[2:3]` indices. More `X1[-1:]` and `y[-1:]`. Lastly, `.predict()` needs a 2D array here: `f1.predict([[6, 9]])`. –  Feb 21 '18 at 11:59
  • Thanks @Evert, i changed as you specified and it works fine now. But, what is the purpose of partial_fit in terms of online learning if i give all my incoming data before the fit() method ?Now i can predict my new data even without partial_fit as shown here: https://ideone.com/SFwdI5. I actually want to retrain my model built with fit() using partial_fit instead from scratch. If i give all new data before fit() method, isnt it retraining all my data from scratch ? I just want to share code with you from ideoone, we cannot run code there. they dont seem to support sklearn package. – user1 Feb 21 '18 at 12:19