0

I trained estimator with this:

def train_estimator(feature_list, expected_values, k=5):
    pipeline = Pipeline([('vect', CountVectorizer(input='filename', stop_words='english')),
                         ('clf', MultinomialNB())])

    parameters = {'vect__ngram_range':[(1, 1), (1, 2), (1, 3)],
                  'vect__min_df':[0.001, 0.01, 0.02, 0.05, 0.1],
                  'vect__max_df':[0.85, 0.90, 0.95, 0.99, 1.0],
                  'clf__alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 1.0]}

    gs_clf = GridSearchCV(pipeline, parameters, n_jobs=6, cv=k, verbose=1, refit=True, scoring='roc_auc')
    gs_clf.fit(feature_list, expected_values)

    return gs_clf.best_estimator_

Now I need to classify some text with this estimator, but its not clear how to vectorize the text properly.

I need to vectorize text and then call estimator.predict() with the vector. The thing is, this vector must agree with the vectors used to train estimator (the word foobar must have the same index as the vectors used to train the model). Its not clear from the documentation how to vectorize text in such a fashion.

How do I write this predict() function?

def predict(estimator, text):
    # Vectorize text and call estimator.predict()

EDIT

feature_list and expected_values were made as follows:

def fetch_training_set(doc_iterator):
    files, labels = list(), list()
    for row in doc_iterator:
        filename = 'somepath/%s.txt' % random()
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(row['text'])

        files.append(filename)
        labels.append(row['label'])

    feature_list = np.array(files)
    expected_values = np.array(labels)

    return feature_list, expected_values
Jay
  • 9,314
  • 7
  • 33
  • 40

1 Answers1

0

I think adding your extra functions train_estimator and predict makes things complex.

gs_clf = GridSearchCV(pipeline, parameters, n_jobs=6, cv=k, verbose=1, refit=True, scoring='roc_auc')
gs_clf.fit(feature_list, expected_values)
gs_clf.predict(your_data)

will do the job (last line). Since you refit (refit=True) your pipeline, gs_clf is refit with the best parameters the grid search found. Then, gs_clf.predict will call the predict functions of each member of your pipeline your pipeline.

geompalik
  • 1,582
  • 11
  • 22