11

Using xgboost.Booster.predict can only get the prediction result of all the tree or the predicted leaf of each tree. But how could I get the prediction value of each tree?

K_Augus
  • 372
  • 2
  • 14
  • https://stackoverflow.com/questions/37677496/how-to-get-access-of-individual-trees-of-a-xgboost-model-in-python-r – LocoGris Jan 28 '19 at 20:14
  • Can anyone show how to get the prediction value per tree for a regressor (hopefully also showing how to get each prediction's residual)? So far, all the answers are for classifiers. – Michael Anderson Jan 24 '23 at 21:10

3 Answers3

5

As of recently, xgboost has introduced a slicing API, and Raul's answer, while valid, is overly complicated.

To get individual predictions all you need is to iterate through the booster object.

individual_preds = []
for tree_ in model.get_booster():
    individual_preds.append(
        tree_.predict(xgb.DMatrix(X))
    )

Note however, that those individual predictions are not individual contributions. E.g. summing them up will not get the final prediction. For that we need to transform them back into log-odds and then sum up:

from scipy.special import expit as sigmoid, logit as inverse_sigmoid
individual_preds = np.vstack(individual_preds)
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)

Fully reproducible example, replicating Raul's efforts

import numpy as np
import xgboost as xgb
from sklearn import datasets
from scipy.special import expit as sigmoid, logit as inverse_sigmoid

# Load data
iris = datasets.load_iris()
X, y = iris.data, (iris.target == 1).astype(int)

# Fit a model
model = xgb.XGBClassifier(
    n_estimators=10,
    max_depth=10,
    use_label_encoder=False,
    objective='binary:logistic'
)
model.fit(X, y)
booster_ = model.get_booster()

# Extract indivudual predictions
individual_preds = []
for tree_ in booster_:
    individual_preds.append(
        tree_.predict(xgb.DMatrix(X))
    )
individual_preds = np.vstack(individual_preds)

# Aggregated individual predictions to final predictions
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)

# Verify correctness
xgb_preds = booster_.predict(xgb.DMatrix(X))
np.testing.assert_almost_equal(final_preds, xgb_preds)
Ufos
  • 3,083
  • 2
  • 32
  • 36
  • 1
    Would you be willing/able to add an example for the XGBRegressor? Perhaps also showing residuals of the individual predictions... – Michael Anderson Jan 17 '23 at 20:23
1

The xgboost.core.Booster has two methods that allows you to do so:

  1. First, xgboost.core.Booster.predict with the parameter pred_leaf set to True allows you to get the predicted leaf indices. Then, is just a matter of getting those indices scores.

  2. To get the leaf scores, we resort to the method xgboost.core.Booster.dump_model, which dumps the structure of the tree ensemble as a plain text or json. The dump contains the leaf scores.

Below I show an example.

First, train a xgboost model on the Iris Dataset.

import os
import json

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import datasets

# Load data
iris = datasets.load_iris()
X, y = iris.data, iris.target
y = (y == 1).astype(int)

# Fit a model
n_estimators = 10
max_depth = 10
model = xgb.XGBClassifier(
    n_estimators=n_estimators,
    max_depth=max_depth,
    min_child_weight=1)
model.fit(X, y)
booster = model.get_booster()

Then get leaf indices predictions.

pred_leaf_index = booster.predict(
    xgb.DMatrix(X),
    pred_leaf=True
).reshape(X.shape[0], n_estimators)

To get the leaf scores we to dump the model as a json file. The resulting dump contains the tree structure.

# Dump the model and load the dump
model_json_path = '/tmp/model.json'
booster.dump_model(model_json_path, dump_format='json')
with open(model_json_path, 'r') as f:
    model_dict = json.loads(f.read())

Now, the following is perhaps the most complex part of this process. The following functions are aimed to get only the leaf scores by each three then for the entire ensamble:


def get_tree_leaf_scores(tree):
    """Retrieve a single tree leaf scores.

    Parameters
    ----------
    tree : dict
        A dictionary representing a single xgboost decision tree 
        (one item of the dump generated by `booster.dump_model`).

    Returns
    -------
    leafs : list
        Each item of the list is the left and right final leafs of
        the final branch of a tree.
    """

    if 'leaf' in tree:
        return tree
    else:
        branch_0 = get_tree_leaf_scores(tree['children'][0])
        branch_1 = get_tree_leaf_scores(tree['children'][1])

        if not isinstance(branch_0, list):
            branch_0 = [branch_0]
        if not isinstance(branch_1, list):
            branch_1 = [branch_1]

        return branch_0 + branch_1

def get_trees_leaf_as_dataframe(model_dict):
    """Retrieve the tree ensemble leaf scores.

    Parameters
    ----------
    model_dict : dict
        The dictionary from loading the dump resulting from:
        `xgboost.core.Booster.dump_model`

    Returns
    -------
    trees_leaf_df : pandas.DataFrame
        Tree/node ids with their leaf score.
    """
    # Get tree nodes
    trees_leaf_df = []
    for tree_idx, tree in enumerate(model_dict):
        tree_leafs = get_tree_leaf_scores(tree)
        tree_leafs = pd.DataFrame(tree_leafs)
        tree_leafs['treeid'] = tree_idx

        trees_leaf_df.append(tree_leafs)

    trees_leaf_df = pd.concat(
        trees_leaf_df
    ).sort_values(['treeid', 'nodeid'])

    trees_leaf_df['id'] = \
        trees_leaf_df.apply(
            lambda x: '%s-%s' % (int(x['treeid']), int(x['nodeid'])), axis=1)

    trees_leaf_df = trees_leaf_df[
        ['treeid', 'nodeid', 'id', 'leaf']
    ].set_index('id')

    return trees_leaf_df

Here is how you get the leaf scores as a DataFrame:

trees_leaf_df = get_trees_leaf_as_dataframe(model_dict)
trees_leaf_df.head()

Out[1]: 
   nodeid      leaf  treeid   id
0       1 -0.555556       0  0-1
4       4 -0.528000       0  0-4
3       6 -0.120000       0  0-6
1       7  0.150000       0  0-7
2       8  0.550000       0  0-8

At this point we are ready to get the model predicted leaf scores, with the help of the following function:


def get_pred_leaf_scores(pred_leaf_index, trees_leaf_df):
    """
    Return
    ------
        The predicted leaf scores.
    """
    tree_ids = range(0, n_estimators)
    pred_leaf_scores = []
    for single_instance_pred_leafs in pred_leaf_index:
        tree_node_id_predictions = [
            '%s-%s' % (treeid, nodeid)
            for treeid, nodeid in zip(tree_ids, single_instance_pred_leafs)]

        single_instnace_pred_leaf_scores = trees_leaf_df.loc[
            tree_node_id_predictions]['leaf'].values

        pred_leaf_scores.append(single_instnace_pred_leaf_scores)

    pred_leaf_scores = pd.DataFrame(pred_leaf_scores)

    return pred_leaf_scores
pred_leaf_scores = get_pred_leaf_scores(pred_leaf_index, trees_leaf_df)
pred_leaf_scores
Out[2]:
            0         1         2  ...         7         8         9
0   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
1   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
2   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
3   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
4   -0.555556 -0.434605 -0.373621  ... -0.248634 -0.231758 -0.215499
..        ...       ...       ...  ...       ...       ...       ...
145 -0.528000 -0.410725 -0.374272  ... -0.072375 -0.236201 -0.058543
146 -0.528000 -0.410725 -0.374272  ... -0.024406 -0.236201 -0.185685
147 -0.528000 -0.410725 -0.374272  ... -0.072375 -0.236201 -0.058543
148 -0.528000 -0.410725 -0.374272  ... -0.250879 -0.236201 -0.215589
149 -0.528000 -0.410725 -0.374272  ... -0.072375 -0.236201 -0.058543

[150 rows x 10 columns]    

If you want to make sure that the leaf scores yield the same probability predictions, do the following:

def from_leafs_scores_to_proba(pred_leaf_scores):
    """
    """

    # Get logistic function logit.
    logit = pred_leaf_scores.sum(axis=1)

    # Compute the logistic function
    pos_class_probability = 1 / (1 + np.exp(-logit))

    # Get negative and positive class probabilities.
    return pos_class_probability

y_scores_from_leafs = from_leafs_scores_to_proba(pred_leaf_scores)

y_scores_from_leafs.values[:10]
Out[9]: 
array([0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579,
       0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579])
y_scores = model.predict_proba(X)[:, 1]
y_scores[:10]
Out[10]: 
array([0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578,
       0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578],
      dtype=float32)
Raul
  • 701
  • 7
  • 6
0

Much better solution is this.

In Python, you can dump the trees as a list of strings:

example:

m = xgb.XGBClassifier(max_depth=2, n_estimators=3).fit(X, y)
m.get_booster().get_dump()`

this is what you'll get:

booster[0]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
    1:[sincelastrun<18.0417] yes=3,no=4,missing=4
        3:leaf=-0.0965415
        4:leaf=-0.0679503
    2:[sincelastrun<695.025] yes=5,no=6,missing=6
        5:leaf=-0.0992546
        6:leaf=-0.0984374
booster[1]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
    1:[sincelastrun<16.8917] yes=3,no=4,missing=4
        3:leaf=-0.0928132
        4:leaf=-0.0676056
    2:[sincelastrun<695.025] yes=5,no=6,missing=6
        5:leaf=-0.0945284
        6:leaf=-0.0937463
booster[2]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
    1:[sincelastrun<18.175] yes=3,no=4,missing=4
        3:leaf=-0.0878571
        4:leaf=-0.0610089
    2:[sincelastrun<695.025] yes=5,no=6,missing=6
        5:leaf=-0.0904395
        6:leaf=-0.0896808
Vojtech Stas
  • 631
  • 8
  • 22