I trained estimator with this:
def train_estimator(feature_list, expected_values, k=5):
pipeline = Pipeline([('vect', CountVectorizer(input='filename', stop_words='english')),
('clf', MultinomialNB())])
parameters = {'vect__ngram_range':[(1, 1), (1, 2), (1, 3)],
'vect__min_df':[0.001, 0.01, 0.02, 0.05, 0.1],
'vect__max_df':[0.85, 0.90, 0.95, 0.99, 1.0],
'clf__alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 1.0]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=6, cv=k, verbose=1, refit=True, scoring='roc_auc')
gs_clf.fit(feature_list, expected_values)
return gs_clf.best_estimator_
Now I need to classify some text with this estimator, but its not clear how to vectorize the text properly.
I need to vectorize text
and then call estimator.predict()
with the vector. The thing is, this vector must agree with the vectors used to train estimator
(the word foobar
must have the same index as the vectors used to train the model). Its not clear from the documentation how to vectorize text
in such a fashion.
How do I write this predict()
function?
def predict(estimator, text):
# Vectorize text and call estimator.predict()
EDIT
feature_list
and expected_values
were made as follows:
def fetch_training_set(doc_iterator):
files, labels = list(), list()
for row in doc_iterator:
filename = 'somepath/%s.txt' % random()
with open(filename, 'w', encoding='utf-8') as f:
f.write(row['text'])
files.append(filename)
labels.append(row['label'])
feature_list = np.array(files)
expected_values = np.array(labels)
return feature_list, expected_values