How to classify text with an estimator?

Question

I trained estimator with this:

def train_estimator(feature_list, expected_values, k=5):
    pipeline = Pipeline([('vect', CountVectorizer(input='filename', stop_words='english')),
                         ('clf', MultinomialNB())])

    parameters = {'vect__ngram_range':[(1, 1), (1, 2), (1, 3)],
                  'vect__min_df':[0.001, 0.01, 0.02, 0.05, 0.1],
                  'vect__max_df':[0.85, 0.90, 0.95, 0.99, 1.0],
                  'clf__alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 1.0]}

    gs_clf = GridSearchCV(pipeline, parameters, n_jobs=6, cv=k, verbose=1, refit=True, scoring='roc_auc')
    gs_clf.fit(feature_list, expected_values)

    return gs_clf.best_estimator_

Now I need to classify some text with this estimator, but its not clear how to vectorize the text properly.

I need to vectorize text and then call estimator.predict() with the vector. The thing is, this vector must agree with the vectors used to train estimator (the word foobar must have the same index as the vectors used to train the model). Its not clear from the documentation how to vectorize text in such a fashion.

How do I write this predict() function?

def predict(estimator, text):
    # Vectorize text and call estimator.predict()

EDIT

feature_list and expected_values were made as follows:

def fetch_training_set(doc_iterator):
    files, labels = list(), list()
    for row in doc_iterator:
        filename = 'somepath/%s.txt' % random()
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(row['text'])

        files.append(filename)
        labels.append(row['label'])

    feature_list = np.array(files)
    expected_values = np.array(labels)

    return feature_list, expected_values

@Vivek Kumar `feature_list` is a list of file names. Each file contains text. — Jay, Mar 09 '17 at 18:50
@Vivek Kumar `feature_list` is actually an `np.array()` of file names. I updated the question with more info. — Jay, Mar 09 '17 at 19:05

score 0 · Answer 1 · answered Mar 10 '17 at 10:46

0

I think adding your extra functions train_estimator and predict makes things complex.

gs_clf = GridSearchCV(pipeline, parameters, n_jobs=6, cv=k, verbose=1, refit=True, scoring='roc_auc')
gs_clf.fit(feature_list, expected_values)
gs_clf.predict(your_data)

will do the job (last line). Since you refit (refit=True) your pipeline, gs_clf is refit with the best parameters the grid search found. Then, gs_clf.predict will call the predict functions of each member of your pipeline your pipeline.

answered Mar 10 '17 at 10:46

geompalik

1,582
11
22

what's `your_data`? is it text or a feature vector? – Jay Mar 10 '17 at 19:32
normally text. But since the input of your CountVectorizer is filename, I should try a filename also. – geompalik Mar 11 '17 at 15:35

How to classify text with an estimator?

1 Answers1