5

I'm using GridSearchCV and a pipeline to classify some text documents. A code snippet:

clf = Pipeline([('vect', TfidfVectorizer()), ('clf', SVC())])
parameters = {'vect__ngram_range' : [(1,2)], 'vect__min_df' : [2], 'vect__stop_words' : ['english'],
                  'vect__lowercase' : [True], 'vect__norm' : ['l2'], 'vect__analyzer' : ['word'], 'vect__binary' : [True], 
                  'clf__kernel' : ['rbf'], 'clf__C' : [100], 'clf__gamma' : [0.01], 'clf__probability' : [True]} 
grid_search = GridSearchCV(clf, parameters, n_jobs = -2, refit = True, cv = 10)
grid_search.fit(corpus, labels)

My problem is that when using grid_serach.predict_proba(new_doc) and then wanting to find out what classes the probabilities corresponds to with grid_search.classes_, I get the following error:

AttributeError: 'GridSearchCV' object has no attribute 'classes_'

What have I missed? I thought that if the last "step" in the pipeline was a classifier, then the return of GridSearchCV is also a classifier. Hence one can use the attributes of that classifier, e.g. classes_.

ouflak
  • 2,458
  • 10
  • 44
  • 49
Josefine
  • 181
  • 1
  • 10

2 Answers2

10

As mentioned in the comments above, the grid_search.best_estimator_.classes_ returned an error message since it returns a pipeline with no attribute .classes_. However, by first calling the step classifier of the pipeline I was able to use the classes attribute. Here is the solution

grid_search.best_estimator_.named_steps['clf'].classes_
Josefine
  • 181
  • 1
  • 10
8

Try grid_search.best_estimator_.classes_.

The return of GridSearchCV is a GridSearchCV instance which is not really an estimator itself. Rather, it instantiates a new estimator for each parameter combination it tries (see the docs).

You may think the return value is a classifier because you can use methods such as predict or predict_proba when refit=True, but the GridSearchCV.predict_proba actually looks like (spoiler from the source):

def predict_proba(self, X):
    """Call predict_proba on the estimator with the best found parameters.
    Only available if ``refit=True`` and the underlying estimator supports
    ``predict_proba``.
    Parameters
    -----------
    X : indexable, length n_samples
        Must fulfill the input assumptions of the
        underlying estimator.
    """
    return self.best_estimator_.predict_proba(X)

Hope this helps.

ldirer
  • 6,606
  • 3
  • 24
  • 30
  • The ´grid_search.best_estimator_.classes_´ did not work. I got an error saying the pipeline did not have an attribute called classes_. However, I manage to find a solution(see the answer). – Josefine Jul 21 '15 at 08:17
  • Ok. I thought this would be the case but it turned out to work for me with an example similar to yours. `grid_search.best_estimator_` is a Pipeline object but I can still get `grid_search.best_estimator_.classes_`. I am using the development version though. Alternatively you can access each step of a pipeline using the `steps` attribute: `dict(grid_search.best_estimator_.steps)["clf"].classes_` should work for you. – ldirer Jul 21 '15 at 08:26
  • Ok, then maybe that's the difference. The solution I found earlier was almost the same, I used named_steps directly instead of creating the dict when using the steps attribute(see the answer). Thanks for the help! – Josefine Jul 21 '15 at 08:52