I am using scikit-learn library and building a pipeline from it.
This is the last (and main) part of pipeline that I build:
preprocessor_steps = [('data_transformer', data_transformer),
('reduce_dim', TruncatedSVD())]
preprocessor = Pipeline(steps=preprocessor_steps)
clustering_steps = [('preprocessor', preprocessor),
('cluster', DummyEstimator())]
clustering = Pipeline(steps=clustering_steps)
data_transformer
has steps like OneHotEncoder, KNNImputer, etc.
Now I have GridSearchCV:
param_grid = [{
'cluster': [KMeans()],
'cluster__n_clusters': range(1, 11),
'cluster__init': ['k-means++', 'random']
},
{
'cluster': [DBSCAN()],
'cluster__eps': [0.5, 0.7, 1],
}]
grid_search = GridSearchCV(estimator=clustering, param_grid=param_grid,
scoring='accuracy', verbose=2, n_jobs=1,
error_score='raise')
grid_search.fit(X_train, y_train)
It works perfectly fine for all the hyperparameters of KMeans but fails for DBSCAN. It gives an error:
AttributeError: 'DBSCAN' object has no attribute 'predict'
I think this is because DBSCAN has 'fit_predict' and not 'predict'. I don't want to change my layout (like finding best pipeline from GridSearchCV) because I have many more parameters and algorithms that I want to compare.