12

I'm trying to use XGBoost, and optimize the eval_metric as auc(as described here).

This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline.

What is the correct way to pass a .fit argument to the sklearn pipeline?

Example:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn

print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)

X, y = load_iris(return_X_y=True)

# Without using the pipeline: 
xgb = XGBClassifier()
xgb.fit(X, y, eval_metric='auc')  # works fine

# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])

# using the pipeline, but not optimizing for 'auc': 
pipe.fit(X, y)  # works fine

# however this does not work (even after correcting the underscores): 
pipe.fit(X, y, classifier__eval_metric='auc')  # fails

The error:
TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'

Regarding the version of xgboost:
xgboost.__version__ shows 0.6
pip3 freeze | grep xgboost shows xgboost==0.6a2.

sapo_cosmico
  • 6,274
  • 12
  • 45
  • 58

2 Answers2

4

The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. It should be two underscores.

From the documentation of Pipeline.fit(), we see that the correct way of supplying params in fit:

Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

So in your case, the correct usage is:

pipe.fit(X_train, y_train, classifier__eval_metric='auc')

(Notice two underscores between name and param)

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Unfortunately that didn't work either, added it to the tested options on the original question. I think it would have worked if it were a parameter of the classifier (e.g. `nr_estimators`), but it is an argument of the fit method of that particular classifier. – sapo_cosmico Mar 15 '17 at 10:53
  • I am using iris data from sklearn, and it is working fine (Not throwing any errors). Please update your scikit and / or xgboost version and try again – Vivek Kumar Mar 15 '17 at 10:58
  • Interesting, can you please tell me the versions you are using? I'm using xgboost version '0.6' and sklearn version '0.18.1' – sapo_cosmico Mar 15 '17 at 11:00
  • There is definitely something strange going on. xgboost.__version__ shows something different from pip freeze for some reason. I changed the example to make it replicable with the iris dataset, could I ask you to see if it runs on yours? (huge thanks, even if it isn't SO policy to thank people) – sapo_cosmico Mar 15 '17 at 11:58
  • I have edited your code for copy and paste. Also I tried on both Python 2.7.6 and Python 3.4.3 and it works fine. – Vivek Kumar Mar 15 '17 at 12:03
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138106/discussion-between-sapo-cosmico-and-vivek-kumar). – sapo_cosmico Mar 15 '17 at 12:17
4

When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV

from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV

It looks like

pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])

score = 'roc_auc'
pipe.fit(X, y) 

param = {
 'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
}

gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)

Also you can use a technique of cross validation

gsearch.fit(X, y)

And you get the best params & the best scores

gsearch.best_params_, gsearch.best_score_
Edward
  • 4,443
  • 16
  • 46
  • 81