2

I have tried this by reading How to get each individual tree's prediction in xgboost?

model = XGBRegressor(n_estimators=1000)
model.fit(X_train, y_train)
booster_ = model.get_booster()
individual_preds = []
for tree_ in booster_:
    individual_preds.append(
        tree_.predict(xgb.DMatrix(X_test)),
    )
individual_preds = np.vstack(individual_preds)

The results from individual trees are far away from the results of using booster_.predict(xgb.DMatrix(X_test)) (centered at 0.5). How to get the individual tree's prediction value for XGBoost Regressor? And how to make them comparable to the ensembled prediction?

Zhang Yongheng
  • 125
  • 2
  • 10
  • What do you mean "*it seems*", and why *exactly* do you think that the returned value is a probability, when regressor trees by default do *not* return probabilities (only classification trees do so). – desertnaut Oct 13 '22 at 22:34
  • @desertnaut because those predictive values from the individual tree estimators are centered at zero, and far from the predictive value from the ensembled model, but you are right it maybe not probabilities. It is just my guess – Zhang Yongheng Oct 13 '22 at 22:48
  • @desertnaut centered at 0.5* and ranges from 0-1, which is totally different than the results by doing ```booster_.predict(xgb.DMatrix(X_test))``` which ranges from -118 to 119 – Zhang Yongheng Oct 14 '22 at 03:31

2 Answers2

0

From xgboost api, iteration_range seems to be suitable for this request, if understood the question ok:

iteration_range (Tuple[int, int]) –

Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

For illustration, I used California housing data to train a XGB regressor model:

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_train, X_valid, y_train, y_valid = train_test_split(housing.data, housing.target, \
test_size = 0.33, random_state = 11)
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dvalid= xgb.DMatrix(data=X_valid, label=y_valid, feature_names=list(housing.feature_names))

# define model and train
params_reg = {"max_depth":4, "eta":0.3, "objective":"reg:squarederror", "subsample":1}
xgb_model_reg = xgb.train(params=params_reg, dtrain=dtrain, num_boost_round=100, \
early_stopping_rounds=20,evals=[(dtrain, "train")])

# predict
y_pred = xgb_model_reg.predict(dvalid)

The prediction for a random row 500 is 1.9630624. I used iteration_range below to include one tree for prediction and then displayed the prediction results against each tree index:

for tree in range(0,100):
    print(a,xgb_model_reg.predict(dvalid,iteration_range=(tree,tree+1))[500])

Here is the output extract:

0 0.9880972
1 0.5706124
2 0.59768033
3 0.51785016
4 0.58512527
5 0.5990092
6 0.6660166
7 0.46186835
8 0.5213114
9 0.5857907
10 0.4683379
11 0.54352343
12 0.46028078
13 0.4823497
14 0.51296484
15 0.49818778
16 0.50080884
...
97 0.5000746
98 0.49949
99 0.5004089
Heelara
  • 861
  • 9
  • 17
  • Thanks! I ran your code and I still can see the results from my y_pred (-2.62 for the first row) and the predictions for each tree (range from 0 to 1 for the first row) are far different. Can you think of any reason why? And seems like from your 100 outputs range from 0 to 1 and centered at 0.5 as well. @Heelara – Zhang Yongheng Oct 14 '22 at 12:54
0

I think I mostly figure out how to construct individual prediction that sum up to overall prediction.

The first, thing is about base_score. According to https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters if the base_score is not set it is estimated. And it is difficult to get and properly apply this estimation. So to get predictable behavior of boosters I suggest explicit set the initial bias to zero. Second, all this transformation with sigmoids is only valid for classifiers. Regressors do not need them at all.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import xgboost

housing = fetch_california_housing()
x_train, x_valid, y_train, y_valid = train_test_split(housing.data, housing.target, \
test_size = 0.33, random_state = 11)
reg = xgboost.XGBRegressor(n_estimators=7, base_score=0)  # explicitly set base_score to zero
reg.fit(x_train, y_train)

xm = xgboost.DMatrix(x_valid)
individual_preds = [booster.predict(xm) for booster in reg.get_booster()]
y = reg.predict(x_valid)
print(sum(individual_preds) – y)  # that sould output zero vector
Askold Ilvento
  • 1,405
  • 1
  • 17
  • 20