1

we all know the common approach to define a pipeline with a dimensionality reduction technique and then a model for training and testing. Then we can apply the GridSearchCv for hyperparameter tuning.

grid = GridSearchCV(
Pipeline([
    ('reduce_dim', PCA()),
    ('classify', RandomForestClassifier(n_jobs = -1))
    ]),
param_grid=[
    {
        'reduce_dim__n_components': range(0.7,0.9,0.1),
        'classify__n_estimators': range(10,50,5),
        'classify__max_features': ['auto', 0.2],
        'classify__min_samples_leaf': [40,50,60],
        'classify__criterion': ['gini', 'entropy']
    }
],
cv=5, scoring='f1')
grid.fit(X,y)

I can understand the above code.

Now i was going through the documentation today and there i found one part code which is little bit strange.

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', 'passthrough'),                        # How does this work??
    ('classify', LinearSVC(dual=False, max_iter=10000))
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,   ### No PCA is used..??
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
X, y = load_digits(return_X_y=True)
grid.fit(X, y)
  1. First of all while defining a pipeline, it used a string 'passthrough' instead of a object.

        ('reduce_dim', 'passthrough'),  ```
    
  2. Then while defining different dimensionality reduction technique for the grid search, it used a different strategy. How does [PCA(iterated_power=7), NMF()] this work ?
            'reduce_dim': [PCA(iterated_power=7), NMF()],
            'reduce_dim__n_components': N_FEATURES_OPTIONS,  # here 
    

Please Someone explain the code to me .

Solved - in one line, the order is ['PCA', 'NMF', 'KBest(chi2)']

Courtesy of - seralouk (see answer below)

For Reference If someone looks for more details 1 2 3

desertnaut
  • 57,590
  • 26
  • 140
  • 166
teddcp
  • 1,514
  • 2
  • 11
  • 25

1 Answers1

1

It is equivalent as far as I know.


In the documentation you have this:

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', 'passthrough'),
    ('classify', LinearSVC(dual=False, max_iter=10000))
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]

Initially we have ('reduce_dim', 'passthrough'), and then 'reduce_dim': [PCA(iterated_power=7), NMF()]

The definition of the PCA is done in the second line.


You could define alternatively:

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', PCA(iterated_power=7)),
    ('classify', LinearSVC(dual=False, max_iter=10000))
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • So, later it assigns a object in the place of passthrough. but ```'reduce_dim': [PCA(iterated_power=7), NMF()]``` how does this work? Does Grid search try one by one ? – teddcp Jun 05 '20 at 13:15
  • It uses both ! To reduce the dimensions – seralouk Jun 05 '20 at 13:16
  • So it will first apply PCA and then NMF to the dataset. Then it will check with ```SelectKBest(chi2)``` . Which ever gets the higher score, it will select that..right ? – teddcp Jun 05 '20 at 13:18
  • 1
    Exactly. the order is `['PCA', 'NMF', 'KBest(chi2)']` – seralouk Jun 05 '20 at 13:19