I have build a incremental learning model but not sure whether it is right or wrong i have 2 training data first consist 20000 rows and second consist 10000 rows both of them having two columns description and id...in case of offline learning my model is working fine it is classifying correct id for given description.. datafile_train is first training data datafile_train1 is second training data I am using SGDClassifier and partial_fit method for incremental
1) Countvectorizer,tfidf and partial_fit
vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train = vectorizer.fit_transform(datafile_train.loc[:,'description'])
X_train_tfidf = tfidf_transformer.fit_transform(X_train)
clf = linear_model.SGDClassifier(penalty='l2',loss='hinge')
prd=clf.partial_fit(X_train_tfidf, datafile_train.loc[:,'taxonomy_id'],classes=np.unique(datafile_train.loc[:,'taxonomy_id']))
after this i pickled classifier and again unpickled to use in next partial_fit for incremental learning
2) pickling and unpickling of classifier
def store(prd):
import pickle
filename = "incremental"
f = open(filename, 'wb')
pickle.dump(prd, f)
f.close()
store(prd)
def train_data():
import pickle
f = open('incremental', 'rb')
classifier = pickle.load(f)
f.close()
return classifier
clfp=train_data()
3) again Countvectorizer,tfidf and partial_fit for new data
vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train1 = vectorizer.fit_transform(datafile_train1.loc[:,'description'])
X_train_tfidf1 = tfidf_transformer.fit_transform(X_train1)
prd1=clfp.partial_fit(X_train_tfidf1, datafile_train1.loc[:,'taxonomy_id'])
# here clfp is previously trained data which is unpickled
i have build model like this but when i checked size of pickle file(first trained data) it is 5 MB and when i used this model to trained new data as you can see in second partial fit i have used clfp(5 MB size) after training new data when i pickle train file for second partial_fit it also shows only 5 MB size it should get updated because i am training new data on previously trained data Is this a correct way to achieve incremental/online learning?? please help i am new to machine learning so it will be good if you explain using code
And this error is thrown
ValueError: Number of features 125897 does not match previous data 124454.
****Edit (using Hashingvectorizer)
hashing = HashingVectorizer()
X_train_hashing=hashing.fit_transform(datafile_train.loc[:,'description'])
clf = linear_model.SGDClassifier(penalty='l2',loss='hinge')
prd=clf.partial_fit(X_train_hashing, datafile_train.loc[:,'taxonomy_id'],classes=np.unique(datafile_train.loc[:,'taxonomy_id']))
def store(prd):
import pickle
filename = "inc"
f = open(filename, 'wb')
pickle.dump(prd, f)
f.close()
store(prd)
def train_data():
import pickle
f = open('inc', 'rb')
classifier = pickle.load(f)
f.close()
return classifier
clfp=train_data()
now i am using clfp train model for next partial_fit
X_train_hashing1=hashing.transform(datafile_train1.loc[:,'description'])
prd1=clfp.partial_fit(X_train_hashing1, datafile_train1.loc[:,'taxonomy_id'])
def store(prd1):
import pickle
timestr = time.strftime("%Y%m%d-%H%M%S")
filename = "Train-" + timestr +".pickle"
f = open(filename, 'wb')
pickle.dump(prd1, f)
f.close()
store(prd1)
In this EDIT it is not giving any error but both pickle file have same size 25.2 MB but second pickle size should be greater than first pickle size because i am using first trained model on new data