9

I have a problem where I'd like to test multiple models that don't all have the same named parameters. How would you use a list of parameters for a pipeline in RandomizedSearchCV like you can use in this example with GridSearchCV?

Example from:
https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', None),
    ('classify', LinearSVC())
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]

grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)
Tryph
  • 5,946
  • 28
  • 49
PL3
  • 413
  • 1
  • 5
  • 15
  • Have you found a solution? – Simon Hessner Aug 31 '18 at 11:36
  • I never did find one already implemented unfortunately. It seems less difficult to me now to implement my self though. Need to create a function that accepts a dict of input parameters (might need a dict with keys for each model with values being a dict of model parameters) that returns the cv score. You probably want to set up the cv train/test sets first so each experiment uses same data. Then I thin you just need to create an iterator for random permutations of the parameters and call the eval function, storing the results. – PL3 Sep 01 '18 at 13:47
  • "I'd like to test multiple models that don't all have the same named parameters. " Your example code does not demonstrate this requirement. – Bert Kellerman Nov 19 '18 at 07:34
  • @BertKellerman yes it does: https://github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/decomposition/pca.py#L126 https://github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/feature_selection/univariate_selection.py#L464 – PL3 Nov 19 '18 at 14:24
  • I see. You want to search different Transformers. The way I've done this is by making wrapper classes for the Transformers that have a boolean `enabled` parameter. Then include them all in the Pipeline. If a transformer wrapper is not enabled, it's `fit` and `transform` do nothing. I can post code if you want. – Bert Kellerman Nov 19 '18 at 16:21
  • I'm no longer working on this problem (new job) so it won't help me, but it may be useful to others that stumble on this question. – PL3 Nov 19 '18 at 18:43
  • https://stackoverflow.com/a/61684583/6347629 – Venkatachalam Nov 08 '22 at 05:35

3 Answers3

1

This is an old issue that was resolved for a while now (not sure starting from which scikit-learn version).

You can now pass a list of dictionaries for RandomizedSearchCV in the param_distributions parameter. Your example code would become:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2

pipe = Pipeline([
    # the reduce_dim stage is populated by the param_grid
    ('reduce_dim', None),
    ('classify', LinearSVC())
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]

grid = RandomizedSearchCV(pipe, cv=3, n_jobs=2, param_distributions=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

I'm using sklearn version 0.23.1 .

Qusai Alothman
  • 1,982
  • 9
  • 23
0

I have found a way around, that relies on duck-typing, and doesn't get too much in the way.

It relies on passing complete estimators as parameters to the pipeline. We first sample the kind of model, and then its parameters. For that we define two classes that can be sampled :

from sklearn.model_selection import ParameterSampler


class EstimatorSampler:
    """
    Class that holds a model and its parameters distribution.
    When sampled, the parameters are first sampled and set to the model, 
    which is returned.

    # Arguments
    ===========
    model : sklearn.base.BaseEstimator
    param_distributions : dict
        Input to ParameterSampler

    # Returns
    =========
    sampled : sklearn.base.BaseEstimator
    """
    def __init__(self, model, param_distributions):
        self.model = model
        self.param_distributions = param_distributions

    def rvs(self, random_state=None):
        sampled_params = next(iter(
            ParameterSampler(self.param_distributions, 
                             n_iter=1, 
                             random_state=random_state)))
        return self.model.set_params(**sampled_params)


class ListSampler:
    """
    List container that when sampled, returns one of its item, 
    with probabilities defined by `probs`.

    # Arguments
    ===========
    items : 1-D array-like
    probs : 1-D array-like of floats
        If not None, it should be the same length of `items`
        and sum to 1.

    # Returns
    =========
    sampled item
    """
    def __init__(self, items, probs=None):
        self.items = items
        self.probs = probs

    def rvs(self, random_state=None):
        item = np.random.choice(self.items, p=self.probs)
        if hasattr(item, 'rvs'):
            return item.rvs(random_state=random_state)
        return item

And the rest of the code is defined below.

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_digits
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.pipeline import Pipeline
    from sklearn.svm import LinearSVC
    from sklearn.decomposition import PCA, NMF
    from sklearn.feature_selection import SelectKBest, chi2

    pipe = Pipeline([
        # the reduce_dim stage is populated by the param_grid
        ('reduce_dim', None),
        ('classify', None)
    ])

    N_FEATURES_OPTIONS = [2, 4, 8]
    dim_reducers = ListSampler([EstimatorSampler(est, {'n_components': N_FEATURES_OPTIONS})
                                for est in [PCA(iterated_power=7), NMF()]] + 
                               [EstimatorSampler(SelectKBest(chi2), {'k': N_FEATURES_OPTIONS})])

    C_OPTIONS = [1, 10, 100, 1000]
    classifiers = EstimatorSampler(LinearSVC(), {'C': C_OPTIONS})

    param_dist = {
        'reduce_dim': dim_reducers, 
        'classify': classifiers
    }

    grid = RandomizedSearchCV(pipe, cv=3, n_jobs=2, scoring='accuracy', param_distributions=param_dist)
    digits = load_digits()
    grid.fit(digits.data, digits.target)
Jacquot
  • 1,750
  • 15
  • 25
0

Hyperopt supports Hyperparameter Tuning across multiple estimators, check this wiki for more details (2.2 A Search Space Example: scikit-learn section).

Check out this post if you want to use sklearn's GridSearch to do that. It suggests an implementation of EstimatorSelectionHelper estimator which can run different estimators, each with its own grid of parameters.

Amine Benatmane
  • 1,191
  • 1
  • 8
  • 15