I would like to use the first step of a scikit-learn pipeline to generate a toy data set in order to evaluate the performance of my analysis. An as-simple-as-it-gets-example solution I came up with looks like the following:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn import cluster
class FeatureGenerator(TransformerMixin):
def __init__(self, num_features=None):
self.num_features = num_features
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transform_params):
return np.array(
range(self.num_features*self.num_features)
).reshape(self.num_features,
self.num_features)
def get_params(self, deep=True):
return {"num_features": self.num_features}
def set_params(self, **parameters):
self.num_features = parameters["num_features"]
return self
This transformer in action would e. g. be called like this:
pipeline = Pipeline([
('pick_features', FeatureGenerator(100)),
('kmeans', cluster.KMeans())
])
pipeline = pipeline.fit(None)
classes = pipeline.predict(None)
print classes
It gets tricky for me as soon as I try to grid search over this pipeline:
parameter_sets = {
'pick_features__num_features' : [10,20,30],
'kmeans__n_clusters' : [2,3,4]
}
pipeline = Pipeline([
('pick_features', FeatureGenerator()),
('kmeans', cluster.KMeans())
])
g_search_estimator = GridSearchCV(pipeline, parameter_sets)
g_search_estimator.fit(None,None)
The grid search expects the samples and the labels as input and is not as robust as the pipeline, which does not complain about None
as input parameter:
TypeError: Expected sequence or array-like, got <type 'NoneType'>
This makes sense, because the grid search needs to divide the data set in to different cv-partitions.
Other than in the above example, I have a lot of parameters, that can be adjusted in the data set generation step. I thus need a solution to include this step into my parameter selection cross-validation.
Question: Is there a way to set the X
s and y
s of the GridSearch from inside the first transformer? Or how could a solution look like, that calls the GridSearch with multiple different data sets (preferably parallel)? Or has anyone tried to customize GridSearchCV
or can point to some reading materials on this?