Python sklearn : fit_transform() does not work for GridSearchCV

Question

I am creating a GridSearchCV classifier as

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
    ('clf', LogisticRegression())
    ])

parameters= {}

gridSearchClassifier = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
    # Fit/train the gridSearchClassifier on Training Set
    gridSearchClassifier.fit(Xtrain, ytrain)

This works well, and I can predict. However, now I want to retrain the classifier. For this I want to do a fit_transform() on some feedback data.

    gridSearchClassifier.fit_transform(Xnew, yNew)

But I get this error

AttributeError: 'GridSearchCV' object has no attribute 'fit_transform'

basically i am trying to fit_transform() on the classifier's internal TfidfVectorizer. I know that i can access the Pipeline's internal components using the named_steps attribute. Can i do something similar for the gridSearchClassifier?

lejlot · Accepted Answer · 2015-12-31T16:03:09.220

4

Just call them step by step.

gridSearchClassifier.fit(Xnew, yNew)
transformed = gridSearchClassifier.transform(Xnew)

the fit_transform is nothing more but these two lines of code, simply not implemented as a single method for GridSearchCV.

update

From comments it seems that you are a bit lost of what GridSearchCV actually does. This is a meta-method to fit a model with multiple hyperparameters. Thus, once you call fit you get an estimator inside the best_estimator_ field of your object. In your case - it is a pipeline, and you can extract any part of it as usual, thus

gridSearchClassifier.fit(Xtrain, ytrain)
clf = gridSearchClassifier.best_estimator_
# do something with clf, its elements etc. 
# for example print clf.named_steps['vect']

you should not use gridsearchcv as a classifier, this is only a method of fitting hyperparameters, once you find them you should work with best_estimator_ instead. However, remember that if you refit the TFIDF vectorizer, then your classifier will be useless; you cannot change data representation and expect old model to work well, you have to refit the whole classifier once your data change (unless this is carefully designed change, and you make sure old dimensions mean exactly the same - sklearn does not support such operations, you would have to implement this from scratch).

edited Dec 31 '15 at 16:03

answered Dec 31 '15 at 15:51

lejlot

64,777
8
131
164

this will refit the whole model. Exactly as would 'fit_transform' – lejlot Dec 31 '15 at 15:58
i did gridSearchClassifier.fit(Xtrain, ytrain) gridSearchClassifier = gridSearchClassifier.transform(Xtrain) and when i tried to do gridSearchClassifier.best_score_ i get the error AttributeError: best_score_ not found – AbtPst Dec 31 '15 at 16:00
is there a way like named_steps to access the internal TfIdfvectorizer? – AbtPst Dec 31 '15 at 16:01
thanks, now it makes more sense. i guess it would be better if i use the TfIdfVectorizer separately – AbtPst Dec 31 '15 at 16:07
so here is my problem. i fit the training set for my GridSearchCV object and lets say it takes 2 hours. now i want to updated the training set with some new examples. if what you are saying is correct, then i will have to fit my GridSearchCV object again on the original training set plus the new stuff. is there any way to avoid this? i would really like to just fit on the new stuff – AbtPst Dec 31 '15 at 19:37
1

you would need a classifier with online learning capabilities (like SGDClassifier from sklearn) and either "frozen" tfidf or modified "by hand", so previous dimensions are the same as before and you only add new ones, and manually feed last classifier as a starting point of new one with new dimensions set to 0. In general - incremental learning is not simple in production. – lejlot Dec 31 '15 at 20:14

David Maust · Answer 2 · 2015-12-31T16:12:07.797

2

@lejot is correct that you should call fit() on the gridSearchClassifier.

Provided refit=True is set on the GridSearchCV, which is the default, you can access best_estimator_ on the fitted gridSearchClassifier.

You can access the already fitted steps:

tfidf = gridSearchClassifier.best_estimator_.named_steps['vect']
clf = gridSearchClassifier.best_estimator_.named_steps['clf']

You can then transform new text in new_X using:

X_vec = tfidf.transform(new_X)

You can make predictions using this X_vec with:

x_pred = clf.predict(X_vec)

You can also make predictions for the text going through the pipeline entire pipeline with.

X_pred = gridSearchClassifier.predict(new_X)

edited Dec 31 '15 at 16:12

answered Dec 31 '15 at 16:01

David Maust

8,080
3
32
36

you mean like gridSearchClassifier.fit(Xnew, yNew) gridSearchClassifier.best_estimator_.named_steps['vect'].transform(Xnew) – AbtPst Dec 31 '15 at 16:03
i also want to use the gridSearchClassifier to get gridSearchClassifier.best_score_ – AbtPst Dec 31 '15 at 16:04
basically once the fit and transform has been completed, i want to use the classifier to predict and do other things – AbtPst Dec 31 '15 at 16:05
Yes exactly. Provided `refit=True` is passed with `GridSearch(clf, params, refit=True)`, you can call transform on any of the transformation steps, or you can call predict on the final estimator step. – David Maust Dec 31 '15 at 16:07
@AbtPst, I updated my answer with more examples of how the fitted gridSearchClassifier can be used. – David Maust Dec 31 '15 at 16:12

Python sklearn : fit_transform() does not work for GridSearchCV

2 Answers2

update