8

xgboost's plotting API states:

xgboost.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='F score', ylabel='Features', importance_type='weight', max_num_features=None, grid=True, **kwargs)¶

Plot importance based on fitted trees.

Parameters:

booster (Booster, XGBModel or dict) – Booster or XGBModel instance, or dict taken by Booster.get_fscore()
...
max_num_features (int, default None) – Maximum number of top features displayed on plot. If None, all features will be displayed.

In my implementation, however, running:

booster_ = XGBClassifier(learning_rate=0.1, max_depth=3, n_estimators=100, 
                      silent=False, objective='binary:logistic', nthread=-1, 
                      gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, 
                      colsample_bytree=1, colsample_bylevel=1, reg_alpha=0,
                      reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0)

booster_.fit(X_train, y_train)

from xgboost import plot_importance
plot_importance(booster_, max_num_features=10)

Returns:

AttributeError: Unknown property max_num_features

While running it without the parameter max_num_features plots correctly the entire feature set (which in my case is gigantic, ~10k features). Any ideas of what's going on?

Thanks in advance.

Details:

> python -V
  Python 2.7.12 :: Anaconda custom (x86_64)

> pip freeze | grep xgboost
  xgboost==0.4a30
Carlo Mazzaferro
  • 838
  • 11
  • 21

4 Answers4

7

Try to upgrade your xgboost library to 0.6. It should solve the problem. To upgrade the package, try this:

$ pip install -U xgboost

If you get an error, try this:

$ brew install gcc@5
$ pip install -U xgboost

(Refer to this https://github.com/dmlc/xgboost/issues/1501)

slfan
  • 8,950
  • 115
  • 65
  • 78
Tamirlan
  • 382
  • 1
  • 4
  • 9
  • Yep! XGboost doesn't have the nicest docs, but after figuring that out it worked. I'll accept your answer as it is more relevant now (kinda forgot about having asked this question). – Carlo Mazzaferro May 05 '17 at 02:37
2

Until further notice I've solved the problem (at least partially) with this script:

def feat_imp(df, model, n_features):

    d = dict(zip(df.columns, model.feature_importances_))
    ss = sorted(d, key=d.get, reverse=True)
    top_names = ss[0:n_features]

    plt.figure(figsize=(15,15))
    plt.title("Feature importances")
    plt.bar(range(n_features), [d[i] for i in top_names], color="r", align="center")
    plt.xlim(-1, n_features)
    plt.xticks(range(n_features), top_names, rotation='vertical')

 feat_imp(filled_train_full, booster_, 20)

enter image description here

Carlo Mazzaferro
  • 838
  • 11
  • 21
  • With XGBRegressor, I get `feature_importances_` not found error. – xgdgsc Mar 02 '17 at 13:57
  • @xgdgsc you may need to update xgboost. feature_importances_ is clearly part of their most recent API. See this post for more info: http://stackoverflow.com/questions/38212649/feature-importance-with-xgbclassifier – Carlo Mazzaferro Mar 02 '17 at 19:21
2

Despite the title of the documentation webpage ("Python API Reference - xgboost 0.6 documentation"), it does not contain the documentation for the 0.6 release of xgboost. Instead it seems to contain the documentation for the latest git master branch.

The 0.6 release of xgboost was made on Jul 29 2016:

This is a stable release of 0.6 version

@tqchen tqchen released this on Jul 29 2016 · 245 commits to master since this release

The commit that added plot_importance()'s max_num_featureswas made on Jan 16 2017:

As a further check, let's inspect the 0.60 release tarball:

pushd /tmp
curl -SLO https://github.com/dmlc/xgboost/archive/v0.60.tar.gz
tar -xf v0.60.tar.gz 
grep num_features xgboost-0.60/python-package/xgboost/plotting.py
# .. silence.

Therefore this seems to be a documentation bug with the xgboost project.

Ray Donnelly
  • 3,920
  • 1
  • 19
  • 20
0

Just something to add here. I still have this error and I believe others have too. So until this issue is resolved here is another way to achieve the same thing:

max = 50
xgboost.plot_importance(dict(sorted(bst.get_fscore().items(), reverse = True, key=lambda x:x[1])[:max]), ax = ax, height = 0.8)

as you can also pass a dict to the plot, you basically get the fscore, sort the items in reverse order, select the desired number of top features then convert back to dict.

I hope this helps anyone else with the same issue trying to plot only a certian number features by their importance starting form the top feature instead of plotting them all.

mj1261829
  • 1,200
  • 3
  • 26
  • 53