3

I already trained a model for topic classification. Then when I am going to transform new data into vectors for prediction, it going wrong. It shows "NotFittedError: CountVectorizer - Vocabulary wasn't fitted." But when I did the prediction by splitting training data into test data in trained model, it works. Here are the code:

from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

# read new dataset
testdf = pd.read_csv('C://Users/KW198/Documents/topic_model/training_data/testdata.csv', encoding='cp950')

testdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 2 columns):
keywords    1800 non-null object
topics      1800 non-null int64
dtypes: int64(1), object(1)
memory usage: 28.2+ KB

# read columns
kw = testdf['keywords']
label = testdf['topics']

# 將預測資料轉為向量
vectorizer = CountVectorizer(min_df=1, stop_words='english')
x_testkw_vec = vectorizer.transform(kw)

Here is an error

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-93-cfcc7201e0f8> in <module>()
      1 # 將預測資料轉為向量
      2 vectorizer = CountVectorizer(min_df=1, stop_words='english')
----> 3 x_testkw_vec = vectorizer.transform(kw)

~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
    918             self._validate_vocabulary()
    919 
--> 920         self._check_vocabulary()
    921 
    922         # use the same matrix-building strategy as fit_transform

~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in _check_vocabulary(self)
    301         """Check if vocabulary is empty or missing (not fit-ed)"""
    302         msg = "%(name)s - Vocabulary wasn't fitted."
--> 303         check_is_fitted(self, 'vocabulary_', msg=msg),
    304 
    305         if len(self.vocabulary_) == 0:

~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
    766 
    767     if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768         raise NotFittedError(msg % {'name': type(estimator).__name__})
    769 
    770 

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Ken Hsieh
  • 65
  • 3
  • 10

1 Answers1

6

You need to call vectorizer.fit() for the count vectorizer to build the dictionary of words before calling vectorizer.transform(). You can also just call vectorizer.fit_transform() that combines both.

But you should not be using a new vectorizer for test or any kind of inference. You need to use the same one you used when training the model, or your results will be random since vocabularies are different (lacking some words, does not have the same alignment etc..)

For that, you can just pickle the vectorizer used in the training and load it on inference/test time.

umutto
  • 7,460
  • 4
  • 43
  • 53
  • I have tried that but when I ran the model.predit() I go an error: ValueError: dimension mismatch. Then I found an answer on stackoverflow, it said should call fit_transform only on the training part of the data, not the test part. https://stackoverflow.com/questions/28093984/scikit-learn-valueerror-dimension-mismatch – Ken Hsieh Mar 29 '18 at 04:01
  • Actually I am going to deploy this model, and I think I need to convert words to vectors before prediction. I separate prediction function and trained model into two files. – Ken Hsieh Mar 29 '18 at 04:05
  • @KenHsieh Yes, you should (well you can use both sets, but ideally you shouldn't) fit the vectorizer on your training data and use that same vectorizer on both training and test. The problem here is that you are creating a new vectorizer and calling transform before initializing the vocabulary. I'm sorry if the first part of my answer is confusing (it fixes the error you've posted but you need to change more to fix the real problem which is using separate vectorizers). Oh and you may have some problems with OOV (out of vocabulary) words, but that's something you have to solve separately. – umutto Mar 29 '18 at 04:07
  • Got it! Now I know what the issue is. Thanks. In order to fix this problem I need to use the vectorizer same as the trained model and find a way to import(?) to the prediction file. – Ken Hsieh Mar 29 '18 at 05:46
  • @KenHsieh You can save (pickle) the vectorizer to a file and then load that at the time of prediction – Vivek Kumar Mar 29 '18 at 05:59
  • @KenHsieh Yes, you need to use the same vectorizer, the mapping between words to indices must be same for correct test results. You can check `vocabulary_` attribute of that vectorizer to see the dictionary of mappings. If you mean vectorizing the prediction files, you can use pickle to save the training vectorizer to disk and load that for test. Bear in mind you only need to use `transform()` on your test files, not `fit` or `fit_transform`. It needs to use the same vocabulary for correct results. – umutto Mar 29 '18 at 05:59
  • for saving vectorizer (specific to CountVectorizer), you can just save your vocabulary as a list of words in a txt file. You can get your fitted vocabulary through `vectorizer.get_feature_names()` attribute. For calling next time, you can feed your words as a list in `vocabulary_` attribute – Itachi Mar 29 '18 at 06:14