Sklearn BaggingClassifier doesn't work with a pipeline(preprocessor, KNeighborsClassifier)

Question

Using sklearn, I have a pipleline that works perfectly and basically looks and works like that :

model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier())

model_1_KNeighborsClassifier.fit(X_train, y_train)

But if I do bagging using this pipeline :

model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)

model_bagging.fit(X_train,y_train)

It doesn't work anymore :

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\__init__.py:423, in _get_column_indices(X, key)
    422 try:
--> 423     all_columns = X.columns
    424 except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[9], line 6
      1 from sklearn.ensemble import BaggingClassifier
      2 model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)
----> 6 model_bagging.fit(X_train,y_train)
      7 #model_bagging.score(X_test,y_test)

File c:\Users\gui-r\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
   1144     estimator._validate_params()
   1146 with config_context(
   1147     skip_parameter_validation=(
   1148         prefer_skip_nested_validation or global_skip_validation
   1149     )
   1150 ):
...
    428     )
    429 if isinstance(key, str):
    430     columns = [key]

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

As if bagging cannot take processed data through the pipeline.

The entire code is the following :

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder,StandardScaler
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.compose import  make_column_transformer
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.ensemble import BaggingClassifier


titanic = sns.load_dataset('titanic')

y = titanic['survived']
X = titanic.drop('survived', axis=1) 

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

numerical_features = [ 'age', 'fare'] 
categorical_features = ['sex', 'deck', 'alone'] 
other_features=['pclass']


numerical_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
categorical_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
other_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent')) 


preprocessor = make_column_transformer((numerical_pipeline, numerical_features),
                                   (categorical_pipeline, categorical_features),
                                   (other_pipeline, other_features),)  

processed_data=preprocessor.fit_transform(titanic)


model_1_KNeighborsClassifier = make_pipeline(preprocessor, KNeighborsClassifier(algorithm='ball_tree',metric='manhattan',n_neighbors=11))


model_bagging = BaggingClassifier(base_estimator=model_1_KNeighborsClassifier,n_estimators=10)


""" here those 2 lines work :
model_1_KNeighborsClassifier.fit(X_train,y_train)
print(model_1_KNeighborsClassifier.score(X_test,y_test)) """

model_bagging.fit(X_train,y_train)
print(model_bagging.score(X_test,y_test))

Any idea on what's wrong ?

Again, the pipeline itself works

Matt Hall · Accepted Answer · 2023-08-10T13:10:18.573

The error tells you what is wrong: Specifying the columns using strings is only supported for pandas DataFrames.

I believe this is because estimator classes (like BaggingClassifier) are subclasses of BaseEstimator, which performs validation on its inputs. Part of this process casts X and y to NumPy arrays using sklearn.utils.check_array(). You can try running this function on your own DataFrame to see that it produces arrays.

The net result is that when you pass a DataFrame in to a pipeline with the preprocessor as the first step, the component can see your feature names. But when you wrap everything in the bagging classifier, the names are removed by its validation process.

I think using positional indices instead will work, but there are probably other ways. For example, you could give the KNN directly to the bagging classifier, then put the result in the pipeline:

knn = KNeighborsClassifier(algorithm='ball_tree',
                           metric='manhattan',
                           n_neighbors=11))
classifier = BaggingClassifier(base_estimator=knn, n_estimators=10)
pipeline = make_pipeline(preprocessor, classifier)

But since the pipeline works perfectly with my pandas DataFrames, why wouldn't it work through bagging (same X_train,y_train)? — Guigui, Aug 09 '23 at 09:17
I updated my answer and suggested another way around the issue. — Matt Hall, Aug 10 '23 at 13:10
I think `BaggingClassifier` is more unique among meta-estimators here than this answer suggests. It has to do the bootstrap resampling for each base estimator, so it converts to arrays to facilitate that. I think sklearn has some "safe indexing" functions that could be used to allow bagging from frames though; consider opening an Issue on their github? — Ben Reiniger, Aug 26 '23 at 20:17

Sklearn BaggingClassifier doesn't work with a pipeline(preprocessor, KNeighborsClassifier)

1 Answers1