5

I would like to use the first step of a scikit-learn pipeline to generate a toy data set in order to evaluate the performance of my analysis. An as-simple-as-it-gets-example solution I came up with looks like the following:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn import cluster

class FeatureGenerator(TransformerMixin):

    def __init__(self, num_features=None):
        self.num_features = num_features

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, **transform_params):
        return np.array(
            range(self.num_features*self.num_features)
        ).reshape(self.num_features,
                  self.num_features)

    def get_params(self, deep=True):
        return {"num_features": self.num_features}

    def set_params(self, **parameters):
        self.num_features = parameters["num_features"]
        return self

This transformer in action would e. g. be called like this:

pipeline = Pipeline([
    ('pick_features', FeatureGenerator(100)),
    ('kmeans', cluster.KMeans())
])

pipeline = pipeline.fit(None)
classes = pipeline.predict(None)
print classes

It gets tricky for me as soon as I try to grid search over this pipeline:

parameter_sets = {
    'pick_features__num_features' : [10,20,30],
    'kmeans__n_clusters' : [2,3,4]
}

pipeline = Pipeline([
    ('pick_features', FeatureGenerator()),
    ('kmeans', cluster.KMeans())
])

g_search_estimator = GridSearchCV(pipeline, parameter_sets)

g_search_estimator.fit(None,None)

The grid search expects the samples and the labels as input and is not as robust as the pipeline, which does not complain about None as input parameter:

TypeError: Expected sequence or array-like, got <type 'NoneType'>

This makes sense, because the grid search needs to divide the data set in to different cv-partitions.


Other than in the above example, I have a lot of parameters, that can be adjusted in the data set generation step. I thus need a solution to include this step into my parameter selection cross-validation.

Question: Is there a way to set the Xs and ys of the GridSearch from inside the first transformer? Or how could a solution look like, that calls the GridSearch with multiple different data sets (preferably parallel)? Or has anyone tried to customize GridSearchCV or can point to some reading materials on this?

Milla Well
  • 3,193
  • 3
  • 35
  • 50

1 Answers1

1

Your code is very clean so it is a pleasure to offer you this quick and dirty solution:

g_search_estimator.fit([1., 1., 1.],[1., 0., 0.])
g_search_estimator.best_params_

Output:

[tons of int64 to float64 conversion warnings]
{'kmeans__n_clusters': 4, 'pick_features__num_features': 10}

Note you need 3 samples because you're doing a (default) 3-fold cross validation.

The error you get happens because of a check performed by the GridSearchCV object so it happens before your transformer has a chance of doing anything. So I would say "no" to your first question:

Is there a way to set the Xs and ys of the GridSearch from inside the first transformer?

EDIT:
I realize this was unnecessarily confusing, the three following lines are equivalent: g_search_estimator.fit([1., 1., 1.], [1., 0., 0.]) g_search_estimator.fit([1., 1., 1.], None) g_search_estimator.fit([1., 1., 1.])

Sorry for hastily throwing random ys in there.

Some explanations about how the grid search computes scores for the different grid points: when you pass scoring=None to the GridSearchCV constructor (this is the default so that's what you have here), it asks the estimator for a score function. If there is such a function it is used for scoring. For KMeans the default score function is essentially the opposite of the sum of distances to cluster centers.
This is an unsupervised metrics so y is not necessary here.

Wrapping it up, you will always be able to:

set the Xs of the GridSearch from inside the first transformer

Just 'transform' the input X into something totally unrelated, no one will complain about it. You do need some input random_X though.
Now if you want to use supervised metrics (I have this feeling from your question) you'll need to specify y as well.
An easy scenario is one where you have a fixed y vector and you want to try several X with that. Then you can just do:

g_search_estimator.fit(random_X, y, scoring=my_scoring_function)

and it should run fine. If you want to search over different values of y it will probably be a bit trickier.

ldirer
  • 6,606
  • 3
  • 24
  • 30
  • doesn't this use the `y`s you pass in to score the solution? – Milla Well Jul 28 '15 at 08:55
  • No, when you pass scoring=None (default) to `GridSearchCV`, it uses the score function of the estimator if there is such a function. If you try specifying a scoring function with this `y` you'll get an error in its execution. For `KMeans` there is a 'default' score function, so it is this one that is used. See `km = KMeans()` then the `km.score` method. It is an unsupervised metrics, essentially (minus) the sum of distances to cluster centers. Do you want to use supervised metrics? If so please add some details to your question. – ldirer Jul 28 '15 at 09:40