Determine WHY Features Are Important in Decision Tree Models

Question

Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?

Is there a way to explain not only what features are important but also WHY they're important?

I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).

In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).

from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()

Update: What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!

I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).

Update2: Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.

Partial Dependence Plots (PDP) were introduced by Friedman (2001) with purpose of interpreting complex Machine Learning algorithms. Interpreting a linear regression model is not as complicated as interpreting Support Vector Machine, Random Forest or Gradient Boosting Machine models, this is were Partial Dependence Plot can come into use. For some statistical explaination you can refer hereand More Advance. Some of the algorithms have methods for finding variable importance but they do not express whether a varaible is positively or negatively affecting the model .

score 1 · Answer 1 · answered Nov 04 '17 at 01:20

The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.

score 1 · Answer 2 · answered Nov 04 '17 at 01:58

1

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?

Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.

answered Nov 04 '17 at 01:58

lejlot

64,777
8
131
164

But I'm thinking more in terms of say a linear regression. In linear regression, we get coefficients. Some coefficients influence the y variable positively (positive coefficient) and some negatively (negative coefficient). In decision trees, can we determine a feature's positive or negative contribution to its predictiveness? And if so, an example would go a long way - procedurally how to tell a feature's positive or negative contribution. – Jarad Nov 04 '17 at 04:05
Programmatically, from the example above, how would you inspect the model to show that f19 contributes either positively or negatively? – Jarad Nov 04 '17 at 04:07
1

The "positive influence" is not a valid concept, it only works for trivial models - to be more precise with linear models. Anything more complex, that can deduce rules like "class 1 if x > 10 AND x < 20" make this concept useless, as "x" is not "positively" or "negatively" influencing. The problem lies in the concept, not model. You could try to relate it to P(x), and see how much of the probability mass falls into these intervals, but this requires disecting model to the set of classification rules, which is not feasible – lejlot Nov 04 '17 at 13:22
It seems I need to do some homework and probably dissect how decision trees work under the hood. Your comments have been helpful. – Jarad Nov 06 '17 at 02:40

score 1 · Accepted Answer · answered Nov 05 '17 at 15:32

tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html

I'd like to clear up some of the wording to make sure we're on the same page.

Predictive power: what features significantly contribute to the prediction
Feature dependence: are the features positively or negatively correlated, i.e., does a change in the feature X cause the prediction y to increase/decrease

1. Predictive power

Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.

2. Correlation/Dependence

As pointed out by @Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).

In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3.

You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).

But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.

Take a look at partial dependency plots: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html

There's also a chapter on it in Elements of Statistical Learning, Chapter 10.13.2.

Is it unreasonable to ask what direction a feature has on the output variable? For example, say you were doing binary classification and feature f19 is "important". Then let's say you inspected the proportion of f19 value counts and saw it had a high proportion predicting a 1 instead of a 0. OK, so you might conclude it influences the output in a positive direction. But let's say you did the same thing and noticed it had a higher proportion of 0 (zeros). In this case, its influences is important because of what it lacks instead of what it helps predict. This is what I'm getting at. — Jarad, Nov 06 '17 at 02:48
You could do a partial dependency plot on the classification prediction, so you'd get `f19` values vs. prediction, for a fixed set of features. "Unreasonable to ask" is really context specific, in a stakeholder setting I think you could look at some important features in a dependency plot, but wouldn't ever attempt to explain the details, or push it as proof of causation. — jonnybazookatone, Nov 06 '17 at 05:10
I'm selecting your answer as it's the most actionable and directionable. Thank you! — Jarad, Nov 06 '17 at 23:21

Determine WHY Features Are Important in Decision Tree Models

3 Answers3