0

I have build a incremental learning model but not sure whether it is right or wrong i have 2 training data first consist 20000 rows and second consist 10000 rows both of them having two columns description and id...in case of offline learning my model is working fine it is classifying correct id for given description.. datafile_train is first training data datafile_train1 is second training data I am using SGDClassifier and partial_fit method for incremental

1) Countvectorizer,tfidf and partial_fit

vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train = vectorizer.fit_transform(datafile_train.loc[:,'description'])
X_train_tfidf = tfidf_transformer.fit_transform(X_train)
clf = linear_model.SGDClassifier(penalty='l2',loss='hinge')
prd=clf.partial_fit(X_train_tfidf, datafile_train.loc[:,'taxonomy_id'],classes=np.unique(datafile_train.loc[:,'taxonomy_id']))

after this i pickled classifier and again unpickled to use in next partial_fit for incremental learning

2) pickling and unpickling of classifier

def store(prd):
    import pickle
    filename = "incremental"
    f = open(filename, 'wb')
    pickle.dump(prd, f)
    f.close()
store(prd)

def train_data():
    import pickle
    f = open('incremental', 'rb')
    classifier = pickle.load(f)
    f.close()
    return classifier
    clfp=train_data()

3) again Countvectorizer,tfidf and partial_fit for new data

vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train1 = vectorizer.fit_transform(datafile_train1.loc[:,'description'])
X_train_tfidf1 = tfidf_transformer.fit_transform(X_train1)
prd1=clfp.partial_fit(X_train_tfidf1, datafile_train1.loc[:,'taxonomy_id'])
# here clfp is previously trained data which is unpickled

i have build model like this but when i checked size of pickle file(first trained data) it is 5 MB and when i used this model to trained new data as you can see in second partial fit i have used clfp(5 MB size) after training new data when i pickle train file for second partial_fit it also shows only 5 MB size it should get updated because i am training new data on previously trained data Is this a correct way to achieve incremental/online learning?? please help i am new to machine learning so it will be good if you explain using code

And this error is thrown

ValueError: Number of features 125897 does not match previous data 124454.

****Edit (using Hashingvectorizer)

hashing = HashingVectorizer()
X_train_hashing=hashing.fit_transform(datafile_train.loc[:,'description'])
clf = linear_model.SGDClassifier(penalty='l2',loss='hinge')
prd=clf.partial_fit(X_train_hashing, datafile_train.loc[:,'taxonomy_id'],classes=np.unique(datafile_train.loc[:,'taxonomy_id']))
def store(prd):
    import pickle
    filename = "inc"
    f = open(filename, 'wb')
    pickle.dump(prd, f)
    f.close()
store(prd)
def train_data():
    import pickle
    f = open('inc', 'rb')
    classifier = pickle.load(f)
    f.close()
    return classifier
 clfp=train_data()

now i am using clfp train model for next partial_fit

X_train_hashing1=hashing.transform(datafile_train1.loc[:,'description'])
prd1=clfp.partial_fit(X_train_hashing1, datafile_train1.loc[:,'taxonomy_id'])
def store(prd1):
    import pickle
    timestr = time.strftime("%Y%m%d-%H%M%S")
    filename = "Train-" + timestr +".pickle"
    f = open(filename, 'wb')
    pickle.dump(prd1, f)
    f.close()
store(prd1) 

In this EDIT it is not giving any error but both pickle file have same size 25.2 MB but second pickle size should be greater than first pickle size because i am using first trained model on new data

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
outlier
  • 331
  • 3
  • 10
  • Your efforts to use `partial_fit()` on SGDClassifier are undone due to re-fitting the CountVectorizer and TfidfTransformer. You need to save the original ones and then use them for the second part of data. But they dont have `partial_fit()` method!!! So if you want to do incremental training you need to switch them for some other transformers. – Vivek Kumar Oct 10 '17 at 12:03
  • @VivekKumar thank you for your response i have edited my code can you please check – outlier Oct 10 '17 at 13:03

1 Answers1

1

I dont think that the saved model size should increase much or maybe at all.

The model is not storing the whole new data sent to partial_fit(), only updating its attributes based on that data. Those attributes once assigned some storage space based on their type (float32, float64 etc) will occupy that much space irrespective of their value.

The notable attributes which will change in SGDClassifier are:-

coef_ : array, shape (1, n_features) if n_classes == 2 else (n_classes, n_features) Weights assigned to the features.

intercept_ : array, shape (1,) if n_classes == 2 else (n_classes,) Constants in decision function.

So when you initialize the model, they are either not assigned or all initialized to 0. Once you pass your first data to partial_fit(), these values are updated according to the data trying to minimize the loss over the predictions.

When you pass the new data, these values again get updated but they still occupy the same storage space as designated to their type (float32, float64 etc).

So thats the reason the saved model dont have their sizes changed.

Community
  • 1
  • 1
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • so according to your answer my model is working incrementally? is there any way to check whether my model have trained both training data? i tried to print attributes for both trained model 1)for first model --> print (prd.coef_) gives [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]] and print (prd.intercept_) gives [-0.26029437 0.07395285 -0.2551581 ] – outlier Oct 11 '17 at 12:13
  • 2)for second model ---> print (prd1.coef_) gives [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]] and print (prd1.intercept_) gives [-0.2859673 0.0863299 -0.26940698] what do you think by looking at this attributes of both trained model – outlier Oct 11 '17 at 12:19
  • @outlier . Since you have used `partial_fit()` and assuming there is no error during dumping and loading the model, this is incremental training. As for verifying, you can try test set from both datas and predict the labels and then calculate the metrics based on it. Anyways, without the actual data and complete output of `coef_` I cannot say anything. – Vivek Kumar Oct 11 '17 at 12:38
  • i used same testing file on first partial_fit and final partial_fit both result were slightly different in case of first partial_fit it is giving very good accuracy and on final partial_fit it is misclassifying few records....any suggestion to increase accuracy of final partial_fit any tuning methods?? – outlier Oct 11 '17 at 13:36
  • if you want i can give you complete output of coef_ just give me your email id so i can send – outlier Oct 11 '17 at 13:37