5

I want to use GridSearchCV for parameter tuning. Is it also possible to check with GridSearchCV whether CountVectorizer or TfidfVectorizer works best? My idea:

pipeline = Pipeline([
           ('vect', TfidfVectorizer()),
           ('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2), (1,3),  
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__max_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)

My idea: CountVectorizer is the same as TfidfVectorizer with use_idf=False and normalize=None. If GridSearchCV gives this as the best result those parameters, then CountVectorizer is the best option. Is that correct?

Thank you in advance :)

yatu
  • 86,083
  • 12
  • 84
  • 139
Abtc
  • 77
  • 1
  • 4

1 Answers1

4

Once you've included a given step with its corresponding name in the Pipeline, you can access it from the parameter grid and add other parameters, or vectorizers in this case, in the grid. You can also have a list of grids in a single pipeline:

from sklearn.feature_extraction.text import CountVectorizer

pipeline = Pipeline([
           ('vect', TfidfVectorizer()),
           ('clf', SGDClassifier()),
])
parameters = [{
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2), (1,3),)  
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2', None),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__max_iter': (10, 50, 80)
},{
    'vect': (CountVectorizer(),)
    # count_vect_params...
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__max_iter': (10, 50, 80)
}]

grid_search = GridSearchCV(pipeline, parameters)
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Just curiosity: how is this different from fitting `GridSearchCV` on two different pipelines? – anddt Oct 08 '20 at 08:35
  • Well, this really will be having 2 different grids of parameters. I think its cleaner to have a sinlge pipeline though, since there is this possiblity @anddt – yatu Oct 08 '20 at 08:36
  • Absolutely, was wondering if this had any practical implication. – anddt Oct 08 '20 at 09:43
  • 1
    Hi, I was just trying to implement the solution. Since I also include Features Selection with SelectK and several parameters for my classifier resulting in 4608000 fits -> so my program crashed. But in other than that the solution worked for me. Thank you – Abtc Oct 08 '20 at 09:48