24

Hopefully I'm reading this wrong but in the XGBoost library documentation, there is note of extracting the feature importance attributes using feature_importances_ much like sklearn's random forest.

However, for some reason, I keep getting this error: AttributeError: 'XGBClassifier' object has no attribute 'feature_importances_'

My code snippet is below:

from sklearn import datasets
import xgboost as xg
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Y = iris.target[ Y < 2] # arbitrarily removing class 2 so it can be 0 and 1
X = X[range(1,len(Y)+1)] # cutting the dataframe to match the rows in Y
xgb = xg.XGBClassifier()
fit = xgb.fit(X, Y)
fit.feature_importances_

It seems that you can compute feature importance using the Booster object by calling the get_fscore attribute. The only reason I'm using XGBClassifier over Booster is because it is able to be wrapped in a sklearn pipeline. Any thoughts on feature extractions? Is anyone else experiencing this?

Minh
  • 2,180
  • 5
  • 23
  • 50
  • I can't reproduce the problem with your snippet. What version of XGBoost do you have? – BrenBarn Jul 05 '16 at 21:06
  • from my `pip freeze` , i have `xgboost==0.4a30` – Minh Jul 05 '16 at 21:22
  • Does this help? https://www.kaggle.com/mmueller/liberty-mutual-group-property-inspection-prediction/xgb-feature-importance-python/comments – Chong Tang Jul 05 '16 at 21:33
  • I have seen this before. The problem is however, is that the `get_fscore` method is bound to the `Booster` object rather than `XGBClassifier` from my understanding. See the doc [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster) – Minh Jul 05 '16 at 21:36
  • I have 0.4 and your snippet works with no problem. – BrenBarn Jul 06 '16 at 01:56
  • Hrm this is odd. The current version is `0.4a30` right? It appears so looking at their [repo](https://github.com/dmlc/xgboost) – Minh Jul 06 '16 at 16:35
  • @MinhMai using `feature_importances_` via booster() are you able to get the column names accurately ? In my case, it throws a KeyError that not certain features are not present in the data. – fixxxer Jun 23 '17 at 18:23

10 Answers10

17

As the comments indicate, I suspect your issue is a versioning one. However if you do not want to/can't update, then the following function should work for you.

def get_xgb_imp(xgb, feat_names):
    from numpy import array
    imp_vals = xgb.booster().get_fscore()
    imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
    total = array(imp_dict.values()).sum()
    return {k:v/total for k,v in imp_dict.items()}


>>> import numpy as np
>>> from xgboost import XGBClassifier
>>> 
>>> feat_names = ['var1','var2','var3','var4','var5']
>>> np.random.seed(1)
>>> X = np.random.rand(100,5)
>>> y = np.random.rand(100).round()
>>> xgb = XGBClassifier(n_estimators=10)
>>> xgb = xgb.fit(X,y)
>>> 
>>> get_xgb_imp(xgb,feat_names)
{'var5': 0.0, 'var4': 0.20408163265306123, 'var1': 0.34693877551020408, 'var3': 0.22448979591836735, 'var2': 0.22448979591836735}
David
  • 9,284
  • 3
  • 41
  • 40
  • Interesting approach! However, would it matter if I tune my parameters for `XGBClassifer`? How would I ensure that it would match the parameters for `Booster` – Minh Jul 06 '16 at 16:33
  • you're referencing the booster() object within your XGBClassifer() object, so it will match: `xgb.booster()` – David Jul 06 '16 at 18:32
  • I realized something strange, and is that supposed to happen? The values returned from xgb.booster().get_fscore() that should contain values for all columns the model is trained for? Because I find 2 columns missing from imp_vals, which are present in train columns, but not as key in imp_cols – Debasish Kanhar Dec 22 '16 at 13:07
  • 9
    I had to use `xgb.get_booster().get_fscore()`. Otherwise I was getting `TypeError: 'str' object is not callable`. I am using xgboost 0.6. – Luís Bianchin Jun 09 '17 at 08:25
  • I pickled my XGB object and am unable to call `get_booster()`: `File "/usr/local/lib/python3.5/dist-packages/xgboost/sklearn.py", line 193, in get_booster raise XGBoostError('need to call fit or load_model beforehand') ` – Max Candocia Sep 15 '19 at 01:04
15

For xgboost, if you use xgb.fit(),then you can use the following method to get feature importance.

import pandas as pd
xgb_model=xgb.fit(x,y)
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_booster().get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
print('',xgb_fea_imp)
xgb_fea_imp.to_csv('xgb_fea_imp.csv')

from xgboost import plot_importance
plot_importance(xgb_model, )
rosefun
  • 1,797
  • 1
  • 21
  • 33
8

I found out the answer. It appears that version 0.4a30 does not have feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier object, you can refer to @David's answer if you want a workaround.

However, what I did is build it from the source by cloning the repo and running . ./build.sh which will install version 0.4 where the feature_importance_ attribute works.

Hope this helps others!

Minh
  • 2,180
  • 5
  • 23
  • 50
5

Get Feature Importance as a sorted data frame

import pandas as pd
import numpy as np
def get_xgb_imp(xgb, feat_names):
    imp_vals = xgb.booster().get_fscore()
    feats_imp = pd.DataFrame(imp_vals,index=np.arange(2)).T
    feats_imp.iloc[:,0]= feats_imp.index    
    feats_imp.columns=['feature','importance']
    feats_imp.sort_values('importance',inplace=True,ascending=False)
    feats_imp.reset_index(drop=True,inplace=True)
    return feats_imp

feature_importance_df = get_xgb_imp(xgb, feat_names)
Ioannis Nasios
  • 8,292
  • 4
  • 33
  • 55
2

For those having the same problem as Luís Bianchin, "TypeError: 'str' object is not callable", I found a solution (that works for me at least) here.

In short, I found modifying David's code from

imp_vals = xgb.booster().get_fscore()

to

imp_vals = xgb.get_fscore()

worked for me.

For more detail I would recommend visiting the link above.

Big thanks to David and ianozsvald

connor.p
  • 866
  • 9
  • 19
  • 30
1

You can also use the built-in plot_importance function:

from xgboost import XGBClassifier, plot_importance
fit = XGBClassifier().fit(X,Y)
plot_importance(fit)

enter image description here

Ferro
  • 1,863
  • 2
  • 14
  • 20
1

The alternative to built-in feature importance can be:

I really like shap package because it provides additional plots. Example:

Importance Plot

xgboost shap importance

Summary Plot

xgboost shap summary

Dependence Plot

xgboost shap dependence

You can read about alternative ways to compute feature importance in Xgboost in this blog post of mine.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
pplonski
  • 5,023
  • 1
  • 30
  • 34
0

An update of the accepted answer since it no longer works:

def get_xgb_imp(xgb_model, feat_names):
    imp_vals = xgb_model.get_fscore()
    imp_dict = {feat: float(imp_vals.get(feat, 0.)) for feat in feat_names}
    total = sum(list(imp_dict.values()))
    return {k: round(v/total, 5) for k,v in imp_dict.items()}
Jeroen Boeye
  • 580
  • 4
  • 18
0

It seems like the api keeps on changing. For xgboost version 1.0.2, just changing from imp_vals = xgb.booster().get_fscore() to imp_vals = xgb.get_booster().get_fscore() in @David's answer does the trick. The updated code is -

from numpy import array

def get_xgb_imp(xgb, feat_names):
    imp_vals = xgb.get_booster().get_fscore()
    imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
    total = array(imp_dict.values()).sum()
    return {k:v/total for k,v in imp_dict.items()}
Aditya Mishra
  • 1,687
  • 2
  • 15
  • 24
0

I used the following code to get feature_importance. Also, I used DictVectorizer() in the pipeline for one_hot_encoding. If you use

v = DictVectorizer()
X_to_dict = X.to_dict("records")
X_transformed = v.fit_transform(X_to_dict)
feature_names = v.get_feature_names()
best_model.get_booster().feature_names = feature_names
xgb.plot_importance(best_model.get_booster())

You can obtain the f_score plot. But I wanted to plot the feature importance against the feature names. So I modified it further. f, ax = plt.subplots(figsize=(10, 30)) plt.barh(feature_names, best_model.feature_importances_) plt.xticks(rotation = 90) plt.show()