0

I am using scikit-learn library and building a pipeline from it.

This is the last (and main) part of pipeline that I build:

preprocessor_steps = [('data_transformer', data_transformer),
                      ('reduce_dim', TruncatedSVD())]
preprocessor = Pipeline(steps=preprocessor_steps)

clustering_steps = [('preprocessor', preprocessor),
                    ('cluster', DummyEstimator())]
clustering = Pipeline(steps=clustering_steps)

data_transformer has steps like OneHotEncoder, KNNImputer, etc.

Now I have GridSearchCV:

param_grid = [{
      'cluster': [KMeans()],
      'cluster__n_clusters': range(1, 11),
      'cluster__init': ['k-means++', 'random']
    },  
    {
      'cluster': [DBSCAN()],
      'cluster__eps': [0.5, 0.7, 1],
    }]

grid_search = GridSearchCV(estimator=clustering, param_grid=param_grid, 
                           scoring='accuracy', verbose=2, n_jobs=1,
                           error_score='raise')
  
grid_search.fit(X_train, y_train)

It works perfectly fine for all the hyperparameters of KMeans but fails for DBSCAN. It gives an error:

AttributeError: 'DBSCAN' object has no attribute 'predict'

I think this is because DBSCAN has 'fit_predict' and not 'predict'. I don't want to change my layout (like finding best pipeline from GridSearchCV) because I have many more parameters and algorithms that I want to compare.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
jainakki
  • 53
  • 6

1 Answers1

2

I get the same problem with AgglomerativeClustering and to resolve this, I use Wrapper like this:

class AgglomerativeClusteringWrapper(AgglomerativeClustering):
    def predict(self,X):
      return self.labels_.astype(int)

So you can change to DBSCAN and all will work.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61