0

I m trying to store the TfIdf vectorizer/model(Don't know whether it is a right word or not) obtained after training the dataset and then loading the stored model to fit the new dataset. Model is stored and loaded using pickle

I have stored the vocabulary of TfIdf obtained during training phase. Then, I load the stored the vocabulary to vectorizer to fit the test data

def Savetfidf(df):
    vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2))
    X = pd.SparseDataFrame(vectorizer.fit_transform(df), columns = vectorizer.get_feature_names(), default_fill_value = 0)
    pickle.dump(vectorizer.vocabulary_, open("features.pkl", "wb"))
    return X

def Loadtfidf(df):
    vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2))
    vocabulary = pickle.load(open(feature, 'rb'))
    vectorizer.vocabulary_ = vocabulary
    X = pd.SparseDataFrame(vectorizer.transform(df), columns = vectorizer.get_feature_names(), default_fill_value = 0)
    return X

I m getting an error

"sklearn.exceptions.NotFittedError: idf vector is not fitted"

As far as I got to know, it is trying to save the whole 'X' separately using idf_ and vocabulary_. But I just want to store the model/vectorizer(Don't know) so that when next time it load the model/vectorizer, I just need to call vectorizer.fit() for the test data, no need to use the training data to call fit_transform(). Is there any way to do that?

1 Answers1

0

Following the instructions here, you can (un)pickle the fitted vectorizer object directly, and it will take care of correct (de)serialization on its own.

BlackBear
  • 22,411
  • 10
  • 48
  • 86
  • I m trying to transform a dataset from string to numeric, the link which you have posted shows the method of saving and loading the classifier model which is way different from what I m trying to do – Harsh Bhagwani Jan 24 '19 at 10:06
  • @HarshBhagwani that is just an example, it works with all scikit models – BlackBear Jan 24 '19 at 10:08